Covariate Inference

Updated: 2025-01-06

library(AllelicSeries)

Overview

The primary goal of allelic series analysis is to determine whether a dose-response relationship exists between the functionality of a gene and the phenotype of interest. A secondary goal might be to quantify the association between covariates and the phenotype in the context of an allelic series model. This vignette will illustrate how to perform inference on the latter relationship under two paradigms: the score test setting and Wald test setting. The score test setting estimates the association between the phenotype and covariates under the null hypothesis of no genotypic effect. An advantage of score testing is that all genes and all association models employed by COAST share a common null model. Thus, the relationship between the phenotype and covariates is characterized by a single set of association parameters. Even if the null hypothesis does not hold for all genes, score testing may be a reasonable approximation when the effect of genotype on the phenotype is relatively small. By contrast, the Wald test setting estimates the association between the phenotype and covariates while allowing for a non-zero genotypic effect. Although more flexible, a drawback of Wald testing is the need to estimate a separate set of association parameters for each gene and each association model employed by COAST.

Example data

The simulated data provided by the DGP function includes a covariate data matrix covar with columns representing an intercept int, age, sex, and 3 genetic principal components pc1-pc3.

set.seed(101)
data <- AllelicSeries::DGP(n = 1e3)
head(data$covar)
#>      int          age sex        pc1        pc2         pc3
#> [1,]   1  0.003229867   1 -0.4170603  0.6779313  0.15930142
#> [2,]   1 -0.038306667   1 -1.1925043  0.7518853  0.83471897
#> [3,]   1 -0.399554232   1 -1.1574644 -0.2112389 -0.27238536
#> [4,]   1 -0.905447329   0 -0.2382042 -0.3219297  0.83435277
#> [5,]   1  0.637206113   1  1.0682585 -0.3522423  1.38841591
#> [6,]   1 -1.418399109   0  0.1814291  0.4061925  0.06444786

Score tests

Score tests estimate the association between the phenotype and covariates in the absence of a genotypic effect. The effect sizes can be estimated by simple linear regression of the phenotype on covariates in the case of a continuous phenotype, or logistic regression in the case of a binary phenotype. Since all genes and all component allelic series tests have a common null model, a single set of coefficients is sufficient to characterize the association between the phenotype and covariates.

# Format score test data.frame.
df <- data.frame(data$covar)
df$y <- data$pheno

# Case of a continuous phenotype.
# An intercept is omitted from the call to `lm` because one is already 
# contained in the covariate matrix.
fit <- lm(y ~ 0 + ., data = df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ 0 + ., data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.2251 -1.1145 -0.3202  0.8159  7.1627 
#> 
#> Coefficients:
#>     Estimate Std. Error t value Pr(>|t|)    
#> int  0.95172    0.07279  13.075  < 2e-16 ***
#> age -0.07020    0.04993  -1.406   0.1600    
#> sex  0.14129    0.10026   1.409   0.1591    
#> pc1  0.04771    0.05007   0.953   0.3409    
#> pc2  0.14178    0.04992   2.840   0.0046 ** 
#> pc3  0.20038    0.04996   4.011  6.5e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.576 on 994 degrees of freedom
#> Multiple R-squared:  0.3135, Adjusted R-squared:  0.3094 
#> F-statistic: 75.67 on 6 and 994 DF,  p-value: < 2.2e-16

The allelic series SKAT test is inherently a score test. The allelic series burden tests may be run as score tests by setting the score_test option to TRUE, as in the following:

results <- AllelicSeries::COAST(
  anno = data$anno,
  geno = data$geno,
  pheno = data$pheno,
  covar = data$covar,
  score_test = TRUE
)

Wald tests

Wald tests estimate the association between the phenotype and covariates allowing for the presence of a genotypic effect. We will focus on estimation of effect sizes for the allelic series burden models. The key difference from score testing is the need to calculate the gene-burden score. This can be achieved using the Aggregator function. The necessary inputs are the annotation vector anno and genotype matrix geno. The aggregation method argument should be set to "none" for the baseline model (default), "sum" for the allelic sum model, and "max" for the allelic max model. The indicator argument should be set to FALSE for additive genotype encoding (default), and TRUE for dominance genotype encoding.

# Example of fitting the baseline allelic series model.
g <- Aggregator(anno = data$anno, geno = data$geno, method = "none")
colnames(g) <- c("g1", "g2", "g3")
df_base <- cbind(data.frame(g), df)
fit <- lm(y ~ 0 + ., data = df_base)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ 0 + ., data = df_base)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.25334 -0.53567  0.00053  0.49116  2.38950 
#> 
#> Coefficients:
#>     Estimate Std. Error t value Pr(>|t|)    
#> g1   1.03675    0.05152  20.123  < 2e-16 ***
#> g2   1.01910    0.02488  40.962  < 2e-16 ***
#> g3   1.03262    0.03324  31.067  < 2e-16 ***
#> int  0.06103    0.03912   1.560 0.119019    
#> age -0.09674    0.02438  -3.967 7.79e-05 ***
#> sex  0.09824    0.04915   1.999 0.045903 *  
#> pc1  0.09550    0.02447   3.903 0.000101 ***
#> pc2  0.15654    0.02440   6.416 2.17e-10 ***
#> pc3  0.26349    0.02447  10.768  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7694 on 991 degrees of freedom
#> Multiple R-squared:  0.837,  Adjusted R-squared:  0.8355 
#> F-statistic: 565.4 on 9 and 991 DF,  p-value: < 2.2e-16

# Example of fitting the allelic series sum model.
g <- Aggregator(anno = data$anno, geno = data$geno, method = "sum")
colnames(g) <- c("g_sum")
df_sum <- cbind(data.frame(g), df)
fit <- lm(y ~ 0 + ., data = df_sum)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ 0 + ., data = df_sum)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.25447 -0.53172  0.00076  0.49099  2.38847 
#> 
#> Coefficients:
#>       Estimate Std. Error t value Pr(>|t|)    
#> g_sum  1.02571    0.01817  56.463  < 2e-16 ***
#> int    0.06204    0.03883   1.598 0.110432    
#> age   -0.09659    0.02435  -3.967 7.80e-05 ***
#> sex    0.09744    0.04889   1.993 0.046540 *  
#> pc1    0.09534    0.02443   3.903 0.000102 ***
#> pc2    0.15644    0.02434   6.427 2.02e-10 ***
#> pc3    0.26281    0.02438  10.778  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7687 on 993 degrees of freedom
#> Multiple R-squared:  0.837,  Adjusted R-squared:  0.8358 
#> F-statistic: 728.3 on 7 and 993 DF,  p-value: < 2.2e-16

# Example of fitting the allelic series max model.
g <- Aggregator(anno = data$anno, geno = data$geno, method = "max")
colnames(g) <- c("g_max")
df_max <- cbind(data.frame(g), df)
fit <- lm(y ~ 0 + ., data = df_max)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ 0 + ., data = df_max)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.2191 -0.5904 -0.0163  0.5239  4.3631 
#> 
#> Coefficients:
#>       Estimate Std. Error t value Pr(>|t|)    
#> g_max  1.16910    0.02421  48.283  < 2e-16 ***
#> int    0.05036    0.04396   1.145 0.252332    
#> age   -0.08737    0.02730  -3.200 0.001418 ** 
#> sex    0.09045    0.05484   1.649 0.099378 .  
#> pc1    0.10294    0.02741   3.756 0.000183 ***
#> pc2    0.15302    0.02730   5.605 2.69e-08 ***
#> pc3    0.23371    0.02733   8.552  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8621 on 993 degrees of freedom
#> Multiple R-squared:  0.7949, Adjusted R-squared:  0.7935 
#> F-statistic: 549.9 on 7 and 993 DF,  p-value: < 2.2e-16

Effect sizes for the allelic series SKAT model may be estimated by fitting a (generalized) linear mixed-effects model with an appropriately specified random-effect for genotype. However, fitting such models is uncommon, as the SKAT test is inherently a score test, and developing a strategy for Wald-type estimation of the fixed-effects in a SKAT model is beyond the scope of this vignette.