Updated: 2025-01-06
The primary goal of allelic series analysis is to determine whether a dose-response relationship exists between the functionality of a gene and the phenotype of interest. A secondary goal might be to quantify the association between covariates and the phenotype in the context of an allelic series model. This vignette will illustrate how to perform inference on the latter relationship under two paradigms: the score test setting and Wald test setting. The score test setting estimates the association between the phenotype and covariates under the null hypothesis of no genotypic effect. An advantage of score testing is that all genes and all association models employed by COAST share a common null model. Thus, the relationship between the phenotype and covariates is characterized by a single set of association parameters. Even if the null hypothesis does not hold for all genes, score testing may be a reasonable approximation when the effect of genotype on the phenotype is relatively small. By contrast, the Wald test setting estimates the association between the phenotype and covariates while allowing for a non-zero genotypic effect. Although more flexible, a drawback of Wald testing is the need to estimate a separate set of association parameters for each gene and each association model employed by COAST.
The simulated data provided by the DGP
function includes
a covariate data matrix covar
with columns representing an
intercept int
, age
, sex
, and 3
genetic principal components pc1
-pc3
.
set.seed(101)
data <- AllelicSeries::DGP(n = 1e3)
head(data$covar)
#> int age sex pc1 pc2 pc3
#> [1,] 1 0.003229867 1 -0.4170603 0.6779313 0.15930142
#> [2,] 1 -0.038306667 1 -1.1925043 0.7518853 0.83471897
#> [3,] 1 -0.399554232 1 -1.1574644 -0.2112389 -0.27238536
#> [4,] 1 -0.905447329 0 -0.2382042 -0.3219297 0.83435277
#> [5,] 1 0.637206113 1 1.0682585 -0.3522423 1.38841591
#> [6,] 1 -1.418399109 0 0.1814291 0.4061925 0.06444786
Score tests estimate the association between the phenotype and covariates in the absence of a genotypic effect. The effect sizes can be estimated by simple linear regression of the phenotype on covariates in the case of a continuous phenotype, or logistic regression in the case of a binary phenotype. Since all genes and all component allelic series tests have a common null model, a single set of coefficients is sufficient to characterize the association between the phenotype and covariates.
# Format score test data.frame.
df <- data.frame(data$covar)
df$y <- data$pheno
# Case of a continuous phenotype.
# An intercept is omitted from the call to `lm` because one is already
# contained in the covariate matrix.
fit <- lm(y ~ 0 + ., data = df)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ 0 + ., data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.2251 -1.1145 -0.3202 0.8159 7.1627
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> int 0.95172 0.07279 13.075 < 2e-16 ***
#> age -0.07020 0.04993 -1.406 0.1600
#> sex 0.14129 0.10026 1.409 0.1591
#> pc1 0.04771 0.05007 0.953 0.3409
#> pc2 0.14178 0.04992 2.840 0.0046 **
#> pc3 0.20038 0.04996 4.011 6.5e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.576 on 994 degrees of freedom
#> Multiple R-squared: 0.3135, Adjusted R-squared: 0.3094
#> F-statistic: 75.67 on 6 and 994 DF, p-value: < 2.2e-16
The allelic series SKAT test is inherently a score
test. The allelic series burden tests may be run as
score tests by setting the score_test
option to
TRUE
, as in the following:
Wald tests estimate the association between the phenotype and
covariates allowing for the presence of a genotypic effect. We will
focus on estimation of effect sizes for the allelic series burden
models. The key difference from score testing is the need to calculate
the gene-burden score. This can be achieved using the
Aggregator
function. The necessary inputs are the
annotation vector anno
and genotype matrix
geno
. The aggregation method
argument should
be set to "none"
for the baseline model (default),
"sum"
for the allelic sum model, and "max"
for
the allelic max model. The indicator
argument should be set
to FALSE
for additive genotype encoding (default), and
TRUE
for dominance genotype encoding.
# Example of fitting the baseline allelic series model.
g <- Aggregator(anno = data$anno, geno = data$geno, method = "none")
colnames(g) <- c("g1", "g2", "g3")
df_base <- cbind(data.frame(g), df)
fit <- lm(y ~ 0 + ., data = df_base)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ 0 + ., data = df_base)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.25334 -0.53567 0.00053 0.49116 2.38950
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> g1 1.03675 0.05152 20.123 < 2e-16 ***
#> g2 1.01910 0.02488 40.962 < 2e-16 ***
#> g3 1.03262 0.03324 31.067 < 2e-16 ***
#> int 0.06103 0.03912 1.560 0.119019
#> age -0.09674 0.02438 -3.967 7.79e-05 ***
#> sex 0.09824 0.04915 1.999 0.045903 *
#> pc1 0.09550 0.02447 3.903 0.000101 ***
#> pc2 0.15654 0.02440 6.416 2.17e-10 ***
#> pc3 0.26349 0.02447 10.768 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.7694 on 991 degrees of freedom
#> Multiple R-squared: 0.837, Adjusted R-squared: 0.8355
#> F-statistic: 565.4 on 9 and 991 DF, p-value: < 2.2e-16
# Example of fitting the allelic series sum model.
g <- Aggregator(anno = data$anno, geno = data$geno, method = "sum")
colnames(g) <- c("g_sum")
df_sum <- cbind(data.frame(g), df)
fit <- lm(y ~ 0 + ., data = df_sum)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ 0 + ., data = df_sum)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.25447 -0.53172 0.00076 0.49099 2.38847
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> g_sum 1.02571 0.01817 56.463 < 2e-16 ***
#> int 0.06204 0.03883 1.598 0.110432
#> age -0.09659 0.02435 -3.967 7.80e-05 ***
#> sex 0.09744 0.04889 1.993 0.046540 *
#> pc1 0.09534 0.02443 3.903 0.000102 ***
#> pc2 0.15644 0.02434 6.427 2.02e-10 ***
#> pc3 0.26281 0.02438 10.778 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.7687 on 993 degrees of freedom
#> Multiple R-squared: 0.837, Adjusted R-squared: 0.8358
#> F-statistic: 728.3 on 7 and 993 DF, p-value: < 2.2e-16
# Example of fitting the allelic series max model.
g <- Aggregator(anno = data$anno, geno = data$geno, method = "max")
colnames(g) <- c("g_max")
df_max <- cbind(data.frame(g), df)
fit <- lm(y ~ 0 + ., data = df_max)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ 0 + ., data = df_max)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.2191 -0.5904 -0.0163 0.5239 4.3631
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> g_max 1.16910 0.02421 48.283 < 2e-16 ***
#> int 0.05036 0.04396 1.145 0.252332
#> age -0.08737 0.02730 -3.200 0.001418 **
#> sex 0.09045 0.05484 1.649 0.099378 .
#> pc1 0.10294 0.02741 3.756 0.000183 ***
#> pc2 0.15302 0.02730 5.605 2.69e-08 ***
#> pc3 0.23371 0.02733 8.552 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8621 on 993 degrees of freedom
#> Multiple R-squared: 0.7949, Adjusted R-squared: 0.7935
#> F-statistic: 549.9 on 7 and 993 DF, p-value: < 2.2e-16
Effect sizes for the allelic series SKAT model may be estimated by fitting a (generalized) linear mixed-effects model with an appropriately specified random-effect for genotype. However, fitting such models is uncommon, as the SKAT test is inherently a score test, and developing a strategy for Wald-type estimation of the fixed-effects in a SKAT model is beyond the scope of this vignette.