Re-clustering ASVs
ASV (stands for Amplicon Sequence Variant; also called ESV for Exact Amplicon Variant) is a DNA sequence obtained from high-throughput analysis of marker genes. OTU are a group of closely related individuals created by clustering sequences based on a threshold of similarity. An ASV is a special case of an OTU with a similarity threshold of 100%. A third concept is the zero-radius OTU zOTU (Edgar 2016) which is the same concept than ASV but compute with other softwares than dada (e.g. vsearch).
The choice between ASV and OTU is important because they lead to different results (Joos et al. (2020), Box 2 in Tedersoo et al. (2022), Chiarello et al. (2022)). Most articles recommend making a choice depending on the question. For example, ASV may be better than OTU for describing a group of very closely related species. In addition, ASV are comparable across different datasets (obtained using identical marker genes). On the other hand, (Tedersoo et al. 2022) report that ASV approaches overestimate the richness of common fungal species (due to haplotype variation), but underestimate the richness of rare species. They therefore recommend the use of OTUs in metabarcoding analyses of fungal communities. Finally, (Kauserud 2023) argues that the ASV term falls within the original OTU term and recommends adopting only the OTU terms, but with a concise and clear report on how the OTUs were generated.
Recent articles (Forster et al. 2019; Antich
et al. 2021; brandt2021?) propose to use both
approach together. They recommend (i) using ASV to denoise the dataset
and (ii) for some questions, clustering the ASV sequences into OTUs.
(garcia2019?) used both
concept to demonstrate that ecotypes (ASV within OTUs) are adapted to
different values of environmental factors favoring the persistence of
OTU across changing environmental conditions. This is the goal of the
function asv2otu()
, using either the
DECIPHER::Clusterize
function from R or the vsearch software.
Using decipher or Vsearch algorithm
data(data_fungi_sp_known)
otu <- asv2otu(data_fungi_sp_known, method = "clusterize")
#> Partitioning sequences by 5-mer similarity:
#> ================================================================================
#> Time difference of 0.15 secs
#>
#> Sorting by relatedness within 1 group:
#> iteration 114 of up to 325 (100.0% stability)
#>
#> Time difference of 5.13 secs
#>
#> Clustering sequences by 8-mer similarity:
#> ================================================================================
#>
#> Time difference of 0.85 secs
#>
#> Clusters via relatedness sorting: 100% (0% exclusively)
#> Clusters via rare 5-mers: 100% (0% exclusively)
#> Estimated clustering effectiveness: 100%
otu_vs <- asv2otu(data_fungi_sp_known, method = "vsearch")
The vsearch method requires the installation of Vsearch.
summary_plot_pq(data_fungi_sp_known)
summary_plot_pq(otu)
Using lulu algorithm (link to LULU article)
Another post-clustering transformation method is implemented in
lulu_pq()
, which uses Frøslev et al.
(2017)’s method for curation of DNA amplicon data. The aim is
more to clean non-biological information than to make explicitly less
clusters. For examples, (brandt2021?) clustered
amplicon sequence variants (ASVs) into operational taxonomic units
(OTUs) with swarm and choose to curate ASVs/OTUs using LULU.
summary_plot_pq(data_fungi_sp_known)
summary_plot_pq(lulu_res$new_physeq)
Tracking number of samples, sequences and clusters
track_wkflow(list(
"Raw data" = data_fungi_sp_known,
"OTU" = otu,
"OTU_vsearch" = otu_vs,
"LULU" = lulu_res[[1]]
))
#> nb_sequences nb_clusters nb_samples
#> Raw data 1106581 651 185
#> OTU 1106581 364 185
#> OTU_vsearch 1106581 362 185
#> LULU 1106581 549 185
Session information
sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so; LAPACK version 3.9.0
#>
#> locale:
#> [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
#> [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
#> [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Paris
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MiscMetabar_0.8.00 purrr_1.0.2 dplyr_1.1.4 dada2_1.30.0
#> [5] Rcpp_1.0.12 ggplot2_3.5.0 phyloseq_1.46.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.2 bitops_1.0-7
#> [3] pbapply_1.7-2 deldir_2.0-4
#> [5] permute_0.9-7 rlang_1.1.3
#> [7] magrittr_2.0.3 ade4_1.7-22
#> [9] RSQLite_2.3.6 matrixStats_1.3.0
#> [11] compiler_4.3.3 mgcv_1.9-1
#> [13] png_0.1-8 systemfonts_1.0.6
#> [15] vctrs_0.6.5 reshape2_1.4.4
#> [17] stringr_1.5.1 pkgconfig_2.0.3
#> [19] crayon_1.5.2 fastmap_1.1.1
#> [21] XVector_0.42.0 labeling_0.4.3
#> [23] utf8_1.2.4 Rsamtools_2.18.0
#> [25] rmarkdown_2.26 ragg_1.3.0
#> [27] bit_4.0.5 xfun_0.43
#> [29] zlibbioc_1.48.2 cachem_1.0.8
#> [31] GenomeInfoDb_1.38.8 jsonlite_1.8.8
#> [33] biomformat_1.30.0 blob_1.2.4
#> [35] highr_0.10 rhdf5filters_1.14.1
#> [37] DelayedArray_0.28.0 Rhdf5lib_1.24.2
#> [39] BiocParallel_1.36.0 jpeg_0.1-10
#> [41] parallel_4.3.3 cluster_2.1.6
#> [43] R6_2.5.1 bslib_0.7.0
#> [45] stringi_1.8.3 RColorBrewer_1.1-3
#> [47] GenomicRanges_1.54.1 jquerylib_0.1.4
#> [49] SummarizedExperiment_1.32.0 iterators_1.0.14
#> [51] knitr_1.46 DECIPHER_2.30.0
#> [53] IRanges_2.36.0 Matrix_1.6-5
#> [55] splines_4.3.3 igraph_2.0.3
#> [57] tidyselect_1.2.1 rstudioapi_0.16.0
#> [59] abind_1.4-5 yaml_2.3.8
#> [61] vegan_2.6-4 codetools_0.2-19
#> [63] hwriter_1.3.2.1 lattice_0.22-6
#> [65] tibble_3.2.1 plyr_1.8.9
#> [67] Biobase_2.62.0 withr_3.0.0
#> [69] ShortRead_1.60.0 evaluate_0.23
#> [71] desc_1.4.3 survival_3.5-8
#> [73] RcppParallel_5.1.7 Biostrings_2.70.3
#> [75] pillar_1.9.0 MatrixGenerics_1.14.0
#> [77] foreach_1.5.2 stats4_4.3.3
#> [79] generics_0.1.3 RCurl_1.98-1.14
#> [81] S4Vectors_0.40.2 munsell_0.5.1
#> [83] scales_1.3.0 glue_1.7.0
#> [85] tools_4.3.3 interp_1.1-6
#> [87] data.table_1.15.4 GenomicAlignments_1.38.2
#> [89] fs_1.6.3 rhdf5_2.46.1
#> [91] grid_4.3.3 ape_5.8
#> [93] latticeExtra_0.6-30 colorspace_2.1-0
#> [95] nlme_3.1-164 GenomeInfoDbData_1.2.11
#> [97] cli_3.6.2 textshaping_0.3.7
#> [99] fansi_1.0.6 S4Arrays_1.2.1
#> [101] gtable_0.3.4 sass_0.4.9
#> [103] digest_0.6.35 BiocGenerics_0.48.1
#> [105] SparseArray_1.2.4 farver_2.1.1
#> [107] memoise_2.0.1 htmltools_0.5.8.1
#> [109] pkgdown_2.0.7 multtest_2.58.0
#> [111] lifecycle_1.0.4 bit64_4.0.5
#> [113] MASS_7.3-60.0.1