The ZINQ package (v2.0) contains functions that conduct association testing between individual taxa of microbiome and clinical variable(s). The microbiome data can be unnormalized or normalized by any existing method, such as rarefaction, TSS, CSS, etc.. The clinical variable(s) can be a single binary, multi-class discrete or continuous variable, or a set of variables consisting of any type of the aformentioned variables. Both unadjusted and adjusted tests can be implemented.
In addition, the ZINQ package (v2.0) updates the logistic component to Firth logistic regression, in order to mitigate the bias when the presence-absence status of taxa is imbalanced across the different groups of the interested clinical variable(s).
The following packages are required for functions and examples in the ZINQ package: quantreg, MASS, logistf, all are available on CRAN.
library(ZINQ)
It is recommended to do sanity check using ZINQ_check before applying ZINQ. The inputs are the un-normalized taxa read count table and metadata. The function will print warnings when
Type warnings() after the sanity check to read the warnings. The warnings are recommendations for applying ZINQ to the data.
The sanity check is mainly about zero inflation. Most normalization methods will keep the original zeroes, thus investigating the un-normalized taxa read count table provides sufficient clues to use ZINQ. ZINQ will also return singularity errors when too many covariates are specified, as in other regression frameworks. For more details, refer to the .pdf manual.
We will use the sample data in the package to demonstrate the ZINQ test. The data contains normalized abundance of two taxa: rarefied abundance of taxon 1 and CSS normalized abundance of taxon 2. Also, it contains a binary clinical variable, which is of interest, and three continuous covariates for adjustment. In total, there are 531 subject. We want to determine whether the two taxa are associated with the binary clinical variable.
data(Sample_Data)
summary(Sample_Data)
#> rarefied_taxon1 CSS_taxon2 X Z1 Z2 Z3
#> Min. : 0.00 Min. : 0.000 0:346 Min. : 0.0 Min. :48.00 Min. :-16.000
#> 1st Qu.: 0.00 1st Qu.: 1.614 1:185 1st Qu.: 127.0 1st Qu.:53.00 1st Qu.: -1.000
#> Median : 4.00 Median : 4.766 Median : 277.0 Median :56.00 Median : 3.000
#> Mean : 31.69 Mean : 4.691 Mean : 348.5 Mean :55.35 Mean : 2.684
#> 3rd Qu.: 32.00 3rd Qu.: 7.350 3rd Qu.: 512.0 3rd Qu.:58.00 3rd Qu.: 7.000
#> Max. :619.00 Max. :14.189 Max. :1598.0 Max. :61.00 Max. : 21.000
We first use ZINQ_tests to conduct marginal tests in the Firth logistic and quantile regression components, for each of the two taxa.
= Sample_Data[, -c(1:2)]
covariates
= vector(mode = "list", length = 2)
result
= cbind(Y=Sample_Data[, 1], covariates)
dat 1]] = ZINQ_tests(formula.logistic=Y~X+Z1+Z2+Z3, formula.quantile=Y~X+Z1+Z2+Z3, C="X", y_CorD = "D", data=dat)
result[[
= cbind(Y=Sample_Data[, 2], covariates)
dat 2]] = ZINQ_tests(formula.logistic=Y~X+Z1+Z2+Z3, formula.quantile=Y~X+Z1+Z2+Z3, C="X", data=dat) result[[
Next, we use the output from ZINQ_tests as the input of ZINQ_combination to obtain the final p-values.
ZINQ_combination(result[[1]], method="Cauchy", taus=c(0.25, 0.5, 0.75))
#> [1] 0.05937489
ZINQ_combination(result[[2]], method="MinP")
#> [1] 0.004742869
ZINQ can detect higher-order associations beyond the simple mean association. The two taxa in the sample data have typical abundance profiles that highlight the power of ZINQ. Their stratified quantile functions according to the two conditions of the clinical variable form a spindle shape or cross with each other. We will use linear regression to show the inadequacy of mean-based methods for the two cases.
= seq(0.01, 0.99, by=0.01)
taus
par(mfrow=c(1,2))
for (ii in 1:2){
= which(Sample_Data$X == 1)
id1 = which(Sample_Data$X == 0)
id0
= Sample_Data[id1, ii]
abundance1 = Sample_Data[id0, ii]
abundance0
= quantile(abundance1, taus)
q1 = quantile(abundance0, taus)
q0
plot(taus, q1, type="l", main=names(Sample_Data)[ii], ylab="quantile", xlab="", ylim=c(0, max(q1, q0)))
mtext(text=expression(tau), side=1, cex=1.5, line=3)
lines(taus, q0, col=2)
abline(h=mean(abundance1), lty=2)
abline(h=mean(abundance0), lty=2, col=2)
legend('topleft', c('quantile X=1', 'mean X=1', 'quantile X=0', 'mean X=0'), col=c(1, 1, 2, 2), lty=c(1, 2, 1, 2), bty='n')
}
= cbind(Y=Sample_Data[, 1], covariates)
dat summary(lm(Y~X+Z1+Z2+Z3, data=dat))
#>
#> Call:
#> lm(formula = Y ~ X + Z1 + Z2 + Z3, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -60.63 -31.36 -22.19 3.31 568.36
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7.06450 47.27603 -0.149 0.8813
#> X1 -5.93787 6.30699 -0.941 0.3469
#> Z1 0.01781 0.01024 1.739 0.0826 .
#> Z2 0.57651 0.85418 0.675 0.5000
#> Z3 1.00943 0.52066 1.939 0.0531 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 66.59 on 526 degrees of freedom
#> Multiple R-squared: 0.02172, Adjusted R-squared: 0.01428
#> F-statistic: 2.92 on 4 and 526 DF, p-value: 0.02083
= cbind(Y=Sample_Data[, 2], covariates)
dat summary(lm(Y~X+Z1+Z2+Z3, data=dat))
#>
#> Call:
#> lm(formula = Y ~ X + Z1 + Z2 + Z3, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -6.6473 -3.4132 -0.0458 2.5284 10.3187
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 7.7835857 2.5665225 3.033 0.00254 **
#> X1 0.2794339 0.3423942 0.816 0.41480
#> Z1 0.0011254 0.0005559 2.024 0.04343 *
#> Z2 -0.0671933 0.0463716 -1.449 0.14793
#> Z3 0.0511081 0.0282654 1.808 0.07115 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.615 on 526 degrees of freedom
#> Multiple R-squared: 0.01916, Adjusted R-squared: 0.0117
#> F-statistic: 2.568 on 4 and 526 DF, p-value: 0.0373