Implementation of ConQuR

Sample data

We will use the sample data in the package to demonstrate how to use ConQuR. The dataset contains 100 taxa from 273 samples, which were sequenced in 3 batches. There are batchid (factor), and the metadata: sbp (systolic blood pressure, the key variable, continuous), sex (covariate 1, binary), race (covariate 2, binary) and age (covariate 3, continuous).

Note:

taxa only contains taxonomic read counts, which is a n (sample) by p (taxa) matrix.
At this step, always do factor() for discrete variables, and do droplevels() to drop unused factor levels.

data(Sample_Data)

taxa = Sample_Data[, 1:100]
taxa[146:150, 1:5]
#>     Abiotrophia Acetitomaculum Acidaminobacter Actinobaculum Actinotignum
#> 276           0              0               0             0            0
#> 277           0              1               0             0            0
#> 278           0              0               0             0            0
#> 279           0              0               0             3            0
#> 280           0              0               0             0            0
batchid = Sample_Data[, 'batchid']
summary(batchid)
#>  0  1  2 
#> 94 86 93
covar = Sample_Data[, c('sex', 'race', 'age', 'sbp')]
summary(covar)
#>  sex     race         age             sbp        
#>  0:126   0:123   Min.   :48.00   Min.   : 80.54  
#>  1:147   1:150   1st Qu.:53.00   1st Qu.:107.42  
#>                  Median :56.00   Median :119.90  
#>                  Mean   :55.18   Mean   :119.52  
#>                  3rd Qu.:58.00   3rd Qu.:129.50  
#>                  Max.   :60.00   Max.   :175.58

Use of ConQuR

We first use ConQuR (default fitting strategy) to remove batch effects from the taxa read count table, with the randomly picked reference batch: Batch 0. For the default, standard logistic regression and standard quantile regression are used, interpolation of the piece-wise estimation is not used when estimating the conditional quantile functions (both original and batch-free) of the taxonmic read count for each sample and each taxon.

Note: option(warn = -1) is used in the vignette to suppress a benign warning from quantile regression indicating the solution is not unique. Since quantile regression is conducted many times throughout the vignette, it becomes unwieldy and difficult to read if the warnings are not suppressed.

options(warn=-1)
taxa_corrected1 = ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar, batch_ref="0")
taxa_corrected1[146:150, 1:5]
#>     Abiotrophia Acetitomaculum Acidaminobacter Actinobaculum Actinotignum
#> 276           0              1               0             0            0
#> 277           0              4               0             0            0
#> 278           0              1               0             0            0
#> 279           0              1               0             1            0
#> 280           0              1               0             0            0

We then apply ConQuR with the penalized fitting strategy: logistic LASSO regression (with the optimal \(\lambda^L\) chosen via cross-validation) and quantile LASSO regression (with the default \(\lambda^Q = \frac{2p}{n}\), can be changed to \(\lambda^Q = \frac{2p}{log(n)}\) by setting lambda_quantile=“2p/logn”) are used, interpolation of the piece-wise estimation is used. The penalized fitting strategy tackles the high-dimensional covariate problem, helps to stablize the estimates, and prevents over-correction (the case that the key variable’s effects are also largely eliminated).

options(warn=-1)
taxa_corrected2 = ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar, batch_ref="0",
                         logistic_lasso=T, quantile_type="lasso", interplt=T)
taxa_corrected2[146:150, 1:5]
#>     Abiotrophia Acetitomaculum Acidaminobacter Actinobaculum Actinotignum
#> 276           0              1               0             0            0
#> 277           0              3               0             0            0
#> 278           0              0               0             0            0
#> 279           0              0               0             0            0
#> 280           0              1               0             0            0

Check how ConQuR works

Next, we compare the corrected taxa read count table to the original one, checking whether the batch effects are eliminated and the key variable’s effects are preserved.

Whether the PCoA plot is more unified w.r.t. batchid

par(mfrow=c(2, 3))

Plot_PCoA(TAX=taxa, factor=batchid, main="Before Correction, Bray-Curtis")
Plot_PCoA(TAX=taxa_corrected1, factor=batchid, main="ConQuR (Default), Bray-Curtis")
Plot_PCoA(TAX=taxa_corrected2, factor=batchid, main="ConQuR (Penalized), Bray-Curtis")

Plot_PCoA(TAX=taxa, factor=batchid, dissimilarity="Aitch", main="Before Correction, Aitchison")
Plot_PCoA(TAX=taxa_corrected1, factor=batchid, dissimilarity="Aitch", main="ConQuR (Default), Aitchison")
Plot_PCoA(TAX=taxa_corrected2, factor=batchid, dissimilarity="Aitch", main="ConQuR (Penalized), Aitchison")

Whether the variability explained by batchid (quantified by PERMANOVA R\(^2\)) is decreased and that explained by the key variable is preserved.

In the original taxa read count table, with the key variable sbp be the 4th variable in covar:

PERMANOVA_R2(TAX=taxa, batchid=batchid, covariates=covar, key_index=4)
#> $tab_count
#>          standard sqrt.dist=T       add=T
#> batch 0.039984533 0.025595993 0.015353159
#> key   0.009755741 0.007050036 0.005166912
#> 
#> $tab_rel
#>         standard sqrt.dist=T      add=T
#> batch 0.03505098 0.022282420 0.03505098
#> key   0.00997297 0.006697091 0.00997297

In the corrected taxa read count table, by ConQuR (default):

PERMANOVA_R2(TAX=taxa_corrected1, batchid=batchid, covariates=covar, key_index=4)
#> $tab_count
#>           standard sqrt.dist=T       add=T
#> batch 0.0003432112 0.003822679 0.005507133
#> key   0.0103287103 0.007363574 0.005428144
#> 
#> $tab_rel
#>          standard sqrt.dist=T       add=T
#> batch 0.002762311 0.005337159 0.002762311
#> key   0.010838443 0.007065430 0.010838443

In the corrected taxa read count table, by ConQuR (penalzed):

PERMANOVA_R2(TAX=taxa_corrected2, batchid=batchid, covariates=covar, key_index=4)
#> $tab_count
#>           standard sqrt.dist=T       add=T
#> batch 0.0004604539 0.003863836 0.005535712
#> key   0.0105355709 0.007436608 0.005484897
#> 
#> $tab_rel
#>          standard sqrt.dist=T       add=T
#> batch 0.004873824 0.006411507 0.004873824
#> key   0.013822662 0.008467074 0.013822662

We see that in both Bray-Curtis dissimilarity on the taxonomic read count or Aitchison dissimilarity on the corresponding relative abundance, by standard PERMANOVA or with modification (euclidifies dissimilarities or add a constant to the non-diagonal dissimilarities such that all eigenvalues are non-negative in the underlying Principal Co-ordinates Analysis), batch variability is always reduced while the SBP variability is preserved.

Whether the corrected taxa read count table can better predict the key variable

Whereas PERMANOVA R\(^2\) reflects variability in the taxonomic read counts explained by batch and the key variable, the prediction accuracy reflects the proportion of the key variable explained by the taxonomic read counts.

sbp = covar[, 'sbp']
taxa_result = list(taxa, taxa_corrected1, taxa_corrected2)

pred_rmse = matrix(ncol=3, nrow=5)
colnames(pred_rmse) = c("Original", "ConQuR (Default)", "ConQuR (Penalized)")

for (ii in 1:3){
  pred_rmse[, ii] = RF_Pred_Regression(TAX=taxa_result[[ii]], variable=sbp)$rmse_across_fold
}

par(mfrow=c(1,1))
boxplot(pred_rmse, main="RMSE of Predicting SBP")

We see that ConQuR can systematically improve the prediction accuracy of SBP from the taxonomic read counts.

Use of Tune_ConQuR

Overall, the fune-tuned result of ConQuR is recommended. We have the following options:

batch_ref_pool: candidates of the reference batch, e.g., c(“0”, “1”).
logistic_lasso_pool: whether use logistic LASSO regression, e.g., c(T, F).
quantile_type_pool: types of quantile regression, e.g., c(“standard”, “lasso”, “composite”).
simple_match_pool: whether use the simple quantile-quantile matching, e.g., c(T, F).
lambda_quantile_pool: candidates of the penalization parameter in quantile LASSO or composite quantile regression, e.g., c(NA, “2p/n”, “2p/logn”). Note: always include NA is “standard” is included in quantile_type_pool.
interplt_pool: whether use interpolation in the piece-wise estimation, e.g., c(T, F).
frequencyL: lower bound of taxon frequency that need fine-tuning, e.g., 0.
frequencyU: upper bound of taxon frequency that need fine-tuning, e.g., 1.
cutoff: cutoff of taxon frequency for fine-tuning, e.g., 0.1.

Below, we do not exhaust the options for a fast computation. We can check the selected fitting strategy for each cutoff:

result_tuned = Tune_ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar,
                           batch_ref_pool=c("0", "1"),
                           logistic_lasso_pool=F, 
                           quantile_type_pool=c("standard", "lasso"),
                           simple_match_pool=F,
                           lambda_quantile_pool=c(NA, "2p/n"),
                           interplt_pool=F,
                           frequencyL=0,
                           frequencyU=1)

result_tuned$method_final
#>                0-0.1  0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6    0.6-0.7 0.7-0.8 0.8-0.9 0.9-1  
#> batch_ref      NA     "0"     NA      "0"     "0"     "0"        "0"     "0"     "0"     "0"    
#> original       "TRUE" "FALSE" "TRUE"  "FALSE" "FALSE" "FALSE"    "FALSE" "FALSE" "FALSE" "FALSE"
#> simple_match   NA     "FALSE" NA      "FALSE" "FALSE" "FALSE"    "FALSE" "FALSE" "FALSE" "FALSE"
#> logistic_lasso NA     "FALSE" NA      "FALSE" "FALSE" "FALSE"    "FALSE" "FALSE" "FALSE" "FALSE"
#> quantile_type  NA     "lasso" NA      "lasso" "lasso" "standard" "lasso" "lasso" "lasso" "lasso"
#> lambda         NA     "2p/n"  NA      "2p/n"  "2p/n"  "NA"       "2p/n"  "2p/n"  "2p/n"  "2p/n" 
#> interplt       NA     "FALSE" NA      "FALSE" "FALSE" "FALSE"    "FALSE" "FALSE" "FALSE" "FALSE"

Check the fine-tuned correcte taxa read count table:

taxa_optimal = result_tuned$tax_final

par(mfrow=c(1, 2))
Plot_PCoA(TAX=taxa_optimal, factor=batchid, main="Fine-Tuned ConQuR, Bray-Curtis")
Plot_PCoA(TAX=taxa_optimal, factor=batchid, dissimilarity="Aitch", main="Fine-Tuned ConQuR, Aitchison")

Use of ConQuR-libsize

The use of ConQuR-libsize is very similar to that of ConQuR, thus is omitted.

ConQuR Package Vignette

Wodan Ling and Michael C. Wu

Overview