Overview

The ZINQ package (v2.0) contains functions that conduct association testing between individual taxa of microbiome and clinical variable(s). The microbiome data can be unnormalized or normalized by any existing method, such as rarefaction, TSS, CSS, etc.. The clinical variable(s) can be a single binary, multi-class discrete or continuous variable, or a set of variables consisting of any type of the aformentioned variables. Both unadjusted and adjusted tests can be implemented.

In addition, the ZINQ package (v2.0) updates the logistic component to Firth logistic regression, in order to mitigate the bias when the presence-absence status of taxa is imbalanced across the different groups of the interested clinical variable(s).

The following packages are required for functions and examples in the ZINQ package: quantreg, MASS, logistf, all are available on CRAN.

library(ZINQ)

Implementation of ZINQ test

Sanity check

It is recommended to do sanity check using ZINQ_check before applying ZINQ. The inputs are the un-normalized taxa read count table and metadata. The function will print warnings when

Library size is a confounder of the variable(s) of interest. ZINQ might not properly control type I error, since it treats zeroes and non-zeroes differently.
For each taxon:
- All read counts are zero. ZINQ will return a singularity error.
- There are limited non-zero read counts (<30 or <15). Recommended settings of ZINQ will be provided.
- There is a perfect separation, e.g., there are all zeroes in one group (case or control), which is an extreme case of imbalance. Use ZINQ with caution.

Type warnings() after the sanity check to read the warnings. The warnings are recommendations for applying ZINQ to the data.

The sanity check is mainly about zero inflation. Most normalization methods will keep the original zeroes, thus investigating the un-normalized taxa read count table provides sufficient clues to use ZINQ. ZINQ will also return singularity errors when too many covariates are specified, as in other regression frameworks. For more details, refer to the .pdf manual.

Sample data

We will use the sample data in the package to demonstrate the ZINQ test. The data contains normalized abundance of two taxa: rarefied abundance of taxon 1 and CSS normalized abundance of taxon 2. Also, it contains a binary clinical variable, which is of interest, and three continuous covariates for adjustment. In total, there are 531 subject. We want to determine whether the two taxa are associated with the binary clinical variable.

data(Sample_Data)
summary(Sample_Data)
#>  rarefied_taxon1    CSS_taxon2     X             Z1               Z2              Z3         
#>  Min.   :  0.00   Min.   : 0.000   0:346   Min.   :   0.0   Min.   :48.00   Min.   :-16.000  
#>  1st Qu.:  0.00   1st Qu.: 1.614   1:185   1st Qu.: 127.0   1st Qu.:53.00   1st Qu.: -1.000  
#>  Median :  4.00   Median : 4.766           Median : 277.0   Median :56.00   Median :  3.000  
#>  Mean   : 31.69   Mean   : 4.691           Mean   : 348.5   Mean   :55.35   Mean   :  2.684  
#>  3rd Qu.: 32.00   3rd Qu.: 7.350           3rd Qu.: 512.0   3rd Qu.:58.00   3rd Qu.:  7.000  
#>  Max.   :619.00   Max.   :14.189           Max.   :1598.0   Max.   :61.00   Max.   : 21.000

Use of ZINQ_tests

We first use ZINQ_tests to conduct marginal tests in the Firth logistic and quantile regression components, for each of the two taxa.

covariates = Sample_Data[, -c(1:2)]

result = vector(mode = "list", length = 2)

dat = cbind(Y=Sample_Data[, 1], covariates)
result[[1]] = ZINQ_tests(formula.logistic=Y~X+Z1+Z2+Z3, formula.quantile=Y~X+Z1+Z2+Z3, C="X", y_CorD = "D", data=dat)

dat = cbind(Y=Sample_Data[, 2], covariates)
result[[2]] = ZINQ_tests(formula.logistic=Y~X+Z1+Z2+Z3, formula.quantile=Y~X+Z1+Z2+Z3, C="X", data=dat)

Use of ZINQ_combination

Next, we use the output from ZINQ_tests as the input of ZINQ_combination to obtain the final p-values.

ZINQ_combination(result[[1]], method="Cauchy", taus=c(0.25, 0.5, 0.75))
#> [1] 0.05937489
ZINQ_combination(result[[2]], method="MinP")
#> [1] 0.004742869

Characteristics of ZINQ test

ZINQ can detect higher-order associations beyond the simple mean association. The two taxa in the sample data have typical abundance profiles that highlight the power of ZINQ. Their stratified quantile functions according to the two conditions of the clinical variable form a spindle shape or cross with each other. We will use linear regression to show the inadequacy of mean-based methods for the two cases.

taus= seq(0.01, 0.99, by=0.01)

par(mfrow=c(1,2))

for (ii in 1:2){
  id1 = which(Sample_Data$X == 1)
  id0 = which(Sample_Data$X == 0)

  abundance1 = Sample_Data[id1, ii]
  abundance0 = Sample_Data[id0, ii]

  q1 = quantile(abundance1, taus)
  q0 = quantile(abundance0, taus)

  plot(taus, q1, type="l", main=names(Sample_Data)[ii], ylab="quantile", xlab="", ylim=c(0, max(q1, q0)))
  mtext(text=expression(tau), side=1, cex=1.5, line=3)
  lines(taus, q0, col=2)

  abline(h=mean(abundance1), lty=2)
  abline(h=mean(abundance0), lty=2, col=2)

  legend('topleft', c('quantile X=1', 'mean X=1', 'quantile X=0', 'mean X=0'), col=c(1, 1, 2, 2), lty=c(1, 2, 1, 2), bty='n')
}


dat = cbind(Y=Sample_Data[, 1], covariates)
summary(lm(Y~X+Z1+Z2+Z3, data=dat))
#> 
#> Call:
#> lm(formula = Y ~ X + Z1 + Z2 + Z3, data = dat)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -60.63 -31.36 -22.19   3.31 568.36 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) -7.06450   47.27603  -0.149   0.8813  
#> X1          -5.93787    6.30699  -0.941   0.3469  
#> Z1           0.01781    0.01024   1.739   0.0826 .
#> Z2           0.57651    0.85418   0.675   0.5000  
#> Z3           1.00943    0.52066   1.939   0.0531 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 66.59 on 526 degrees of freedom
#> Multiple R-squared:  0.02172,    Adjusted R-squared:  0.01428 
#> F-statistic:  2.92 on 4 and 526 DF,  p-value: 0.02083

dat = cbind(Y=Sample_Data[, 2], covariates)
summary(lm(Y~X+Z1+Z2+Z3, data=dat))
#> 
#> Call:
#> lm(formula = Y ~ X + Z1 + Z2 + Z3, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -6.6473 -3.4132 -0.0458  2.5284 10.3187 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  7.7835857  2.5665225   3.033  0.00254 **
#> X1           0.2794339  0.3423942   0.816  0.41480   
#> Z1           0.0011254  0.0005559   2.024  0.04343 * 
#> Z2          -0.0671933  0.0463716  -1.449  0.14793   
#> Z3           0.0511081  0.0282654   1.808  0.07115 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.615 on 526 degrees of freedom
#> Multiple R-squared:  0.01916,    Adjusted R-squared:  0.0117 
#> F-statistic: 2.568 on 4 and 526 DF,  p-value: 0.0373

ZINQ Package Vignette