The ConQuR package (v2.0) contains functions that conduct batch effects removal from a taxa read count table by a conditional quantile regression method. The distributional attributes of microbiome data - zero-inflation and over-dispersion, are simultaneously considered. To achieve the maximum removal of batch effects, a function tuning among the variations of ConQuR is provided. Supporting functions to compute PERMANOVA R\(^2\), print PCoA plot and predict key variables based on a taxa read count table are also provided.
The following packages are required for functions and examples in the ZINQ package: quantreg, cqrReg, glmnet, dplyr, doParallel, gplots, vegan, ade4, compositions, randomForest, ROCR, ape, GUniFrac, fastDummies.
Note: Due to technical issues, always library doParallel together with ConQuR.
We will use the sample data in the package to demonstrate how to use ConQuR. The dataset contains 100 taxa from 273 samples, which were sequenced in 3 batches. There are batchid (factor), and the metadata: sbp (systolic blood pressure, the key variable, continuous), sex (covariate 1, binary), race (covariate 2, binary) and age (covariate 3, continuous).
Note:
data(Sample_Data)
taxa = Sample_Data[, 1:100]
taxa[146:150, 1:5]
#> Abiotrophia Acetitomaculum Acidaminobacter Actinobaculum Actinotignum
#> 276 0 0 0 0 0
#> 277 0 1 0 0 0
#> 278 0 0 0 0 0
#> 279 0 0 0 3 0
#> 280 0 0 0 0 0
batchid = Sample_Data[, 'batchid']
summary(batchid)
#> 0 1 2
#> 94 86 93
covar = Sample_Data[, c('sex', 'race', 'age', 'sbp')]
summary(covar)
#> sex race age sbp
#> 0:126 0:123 Min. :48.00 Min. : 80.54
#> 1:147 1:150 1st Qu.:53.00 1st Qu.:107.42
#> Median :56.00 Median :119.90
#> Mean :55.18 Mean :119.52
#> 3rd Qu.:58.00 3rd Qu.:129.50
#> Max. :60.00 Max. :175.58
We first use ConQuR (default fitting strategy) to remove batch effects from the taxa read count table, with the randomly picked reference batch: Batch 0. For the default, standard logistic regression and standard quantile regression are used, interpolation of the piece-wise estimation is not used when estimating the conditional quantile functions (both original and batch-free) of the taxonmic read count for each sample and each taxon.
Note: option(warn = -1) is used in the vignette to suppress a benign warning from quantile regression indicating the solution is not unique. Since quantile regression is conducted many times throughout the vignette, it becomes unwieldy and difficult to read if the warnings are not suppressed.
options(warn=-1)
taxa_corrected1 = ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar, batch_ref="0")
taxa_corrected1[146:150, 1:5]
#> Abiotrophia Acetitomaculum Acidaminobacter Actinobaculum Actinotignum
#> 276 0 1 0 0 0
#> 277 0 4 0 0 0
#> 278 0 1 0 0 0
#> 279 0 1 0 1 0
#> 280 0 1 0 0 0
We then apply ConQuR with the penalized fitting strategy: logistic LASSO regression (with the optimal \(\lambda^L\) chosen via cross-validation) and quantile LASSO regression (with the default \(\lambda^Q = \frac{2p}{n}\), can be changed to \(\lambda^Q = \frac{2p}{log(n)}\) by setting lambda_quantile=“2p/logn”) are used, interpolation of the piece-wise estimation is used. The penalized fitting strategy tackles the high-dimensional covariate problem, helps to stablize the estimates, and prevents over-correction (the case that the key variable’s effects are also largely eliminated).
options(warn=-1)
taxa_corrected2 = ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar, batch_ref="0",
logistic_lasso=T, quantile_type="lasso", interplt=T)
taxa_corrected2[146:150, 1:5]
#> Abiotrophia Acetitomaculum Acidaminobacter Actinobaculum Actinotignum
#> 276 0 1 0 0 0
#> 277 0 3 0 0 0
#> 278 0 0 0 0 0
#> 279 0 0 0 0 0
#> 280 0 1 0 0 0
Next, we compare the corrected taxa read count table to the original one, checking whether the batch effects are eliminated and the key variable’s effects are preserved.
par(mfrow=c(2, 3))
Plot_PCoA(TAX=taxa, factor=batchid, main="Before Correction, Bray-Curtis")
Plot_PCoA(TAX=taxa_corrected1, factor=batchid, main="ConQuR (Default), Bray-Curtis")
Plot_PCoA(TAX=taxa_corrected2, factor=batchid, main="ConQuR (Penalized), Bray-Curtis")
Plot_PCoA(TAX=taxa, factor=batchid, dissimilarity="Aitch", main="Before Correction, Aitchison")
Plot_PCoA(TAX=taxa_corrected1, factor=batchid, dissimilarity="Aitch", main="ConQuR (Default), Aitchison")
Plot_PCoA(TAX=taxa_corrected2, factor=batchid, dissimilarity="Aitch", main="ConQuR (Penalized), Aitchison")
In the original taxa read count table, with the key variable sbp be the 4th variable in covar:
PERMANOVA_R2(TAX=taxa, batchid=batchid, covariates=covar, key_index=4)
#> $tab_count
#> standard sqrt.dist=T add=T
#> batch 0.039984533 0.025595993 0.015353159
#> key 0.009755741 0.007050036 0.005166912
#>
#> $tab_rel
#> standard sqrt.dist=T add=T
#> batch 0.03505098 0.022282420 0.03505098
#> key 0.00997297 0.006697091 0.00997297
In the corrected taxa read count table, by ConQuR (default):
PERMANOVA_R2(TAX=taxa_corrected1, batchid=batchid, covariates=covar, key_index=4)
#> $tab_count
#> standard sqrt.dist=T add=T
#> batch 0.0003432112 0.003822679 0.005507133
#> key 0.0103287103 0.007363574 0.005428144
#>
#> $tab_rel
#> standard sqrt.dist=T add=T
#> batch 0.002762311 0.005337159 0.002762311
#> key 0.010838443 0.007065430 0.010838443
In the corrected taxa read count table, by ConQuR (penalzed):
PERMANOVA_R2(TAX=taxa_corrected2, batchid=batchid, covariates=covar, key_index=4)
#> $tab_count
#> standard sqrt.dist=T add=T
#> batch 0.0004604539 0.003863836 0.005535712
#> key 0.0105355709 0.007436608 0.005484897
#>
#> $tab_rel
#> standard sqrt.dist=T add=T
#> batch 0.004873824 0.006411507 0.004873824
#> key 0.013822662 0.008467074 0.013822662
We see that in both Bray-Curtis dissimilarity on the taxonomic read count or Aitchison dissimilarity on the corresponding relative abundance, by standard PERMANOVA or with modification (euclidifies dissimilarities or add a constant to the non-diagonal dissimilarities such that all eigenvalues are non-negative in the underlying Principal Co-ordinates Analysis), batch variability is always reduced while the SBP variability is preserved.
Whereas PERMANOVA R\(^2\) reflects variability in the taxonomic read counts explained by batch and the key variable, the prediction accuracy reflects the proportion of the key variable explained by the taxonomic read counts.
sbp = covar[, 'sbp']
taxa_result = list(taxa, taxa_corrected1, taxa_corrected2)
pred_rmse = matrix(ncol=3, nrow=5)
colnames(pred_rmse) = c("Original", "ConQuR (Default)", "ConQuR (Penalized)")
for (ii in 1:3){
pred_rmse[, ii] = RF_Pred_Regression(TAX=taxa_result[[ii]], variable=sbp)$rmse_across_fold
}
par(mfrow=c(1,1))
boxplot(pred_rmse, main="RMSE of Predicting SBP")
We see that ConQuR can systematically improve the prediction accuracy of SBP from the taxonomic read counts.
Overall, the fune-tuned result of ConQuR is recommended. We have the following options:
Below, we do not exhaust the options for a fast computation. We can check the selected fitting strategy for each cutoff:
result_tuned = Tune_ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar,
batch_ref_pool=c("0", "1"),
logistic_lasso_pool=F,
quantile_type_pool=c("standard", "lasso"),
simple_match_pool=F,
lambda_quantile_pool=c(NA, "2p/n"),
interplt_pool=F,
frequencyL=0,
frequencyU=1)
result_tuned$method_final
#> 0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1
#> batch_ref NA "0" NA "0" "0" "0" "0" "0" "0" "0"
#> original "TRUE" "FALSE" "TRUE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
#> simple_match NA "FALSE" NA "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
#> logistic_lasso NA "FALSE" NA "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
#> quantile_type NA "lasso" NA "lasso" "lasso" "standard" "lasso" "lasso" "lasso" "lasso"
#> lambda NA "2p/n" NA "2p/n" "2p/n" "NA" "2p/n" "2p/n" "2p/n" "2p/n"
#> interplt NA "FALSE" NA "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
Check the fine-tuned correcte taxa read count table:
The use of ConQuR-libsize is very similar to that of ConQuR, thus is omitted.