Use the MMseqs2 software to assign taxonomy to sequences.
The preferred usage is to provide a reference FASTA file in SINTAX format
via ref_fasta. The function builds a temporary MMseqs2 taxonomy database
from the SINTAX headers and then runs mmseqs easy-taxonomy with the
requested --lca-mode, giving the same LCA behaviour as the database
path.
Alternatively, a pre-built MMseqs2 database with NCBI taxonomy can be
passed via the database parameter (created via mmseqs createdb +
mmseqs createtaxdb, or downloaded with mmseqs databases). In this
case, the MMseqs2 native easy-taxonomy LCA workflow is used. See the
MMseqs2 wiki for details.
Usage
assign_mmseqs2(
physeq = NULL,
ref_fasta = NULL,
database = NULL,
seq2search = NULL,
mmseqs2path = find_mmseqs2(),
behavior = c("return_matrix", "add_to_phyloseq"),
suffix = "_mmseqs2",
lca_mode = 3,
lca_ranks = c("superkingdom", "phylum", "class", "order", "family", "genus", "species"),
column_names = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species"),
search_type = 3,
sensitivity = NULL,
min_seq_id = NULL,
e_value = NULL,
max_accept = 5,
nproc = 1,
clean_pq = TRUE,
simplify_taxo = TRUE,
keep_temporary_files = FALSE,
verbose = FALSE,
cmd_args = ""
)Arguments
- physeq
(required) a
phyloseq-classobject obtained using thephyloseqpackage.- ref_fasta
Either a Biostrings::DNAStringSet object or a path to a FASTA file in SINTAX format (taxonomy in headers after
;tax=). Only used ifdatabaseis not set. Seeassign_sintax()for the SINTAX format specification.- database
(optional) Path to a pre-built MMseqs2 database with NCBI taxonomy information. Only used if
ref_fastais not set.- seq2search
(optional) A Biostrings::DNAStringSet object. Use instead of
physeqto search arbitrary sequences. Cannot be used together withphyseq.- mmseqs2path
Path to the
mmseqsbinary (default:find_mmseqs2()).- behavior
Either
"return_matrix"(default) or"add_to_phyloseq":"return_matrix": return a data frame with taxonomic assignments."add_to_phyloseq": return a phyloseq object with the taxonomy appended to thetax_tableslot.
- suffix
(character) Suffix appended to new taxonomy column names (default:
"_mmseqs2").- lca_mode
(integer) The LCA mode used by MMseqs2:
1: single search LCA3(default): approximate 2bLCA (fast, recommended)4: top-hit LCA (all equal-scoring top hits)
- lca_ranks
Character vector of NCBI taxonomy rank names passed to
--lca-ranks(default:c("superkingdom", "phylum", "class", "order", "family", "genus", "species")).- column_names
Character vector of output column names, must be the same length as
lca_ranks(default:c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")).- search_type
(integer) MMseqs2 search type:
0: auto-detect2: translated nucleotide3(default): nucleotide
- sensitivity
(numeric, optional) Search sensitivity (
-sparameter). Higher values are slower but more sensitive (range 1–7). IfNULL, MMseqs2 uses its default.- min_seq_id
(numeric, optional) Minimum sequence identity (0–1). If
NULL, MMseqs2 uses its default.- e_value
(numeric, optional) Maximum E-value threshold (
-e). IfNULL, MMseqs2 uses its default.- max_accept
(integer, optional) Maximum number of hits accepted per query (
--max-accept). Useful withlca_mode = 1or4to widen the hit set used for LCA (default:5).- nproc
(integer) Number of threads (default: 1).
- clean_pq
(logical) Clean the phyloseq object before searching? (default:
TRUE).- simplify_taxo
(logical) Apply
simplify_taxo()to the result? Only used whenbehavior = "add_to_phyloseq"(default:TRUE).- keep_temporary_files
(logical) Keep intermediate files for debugging? (default:
FALSE).- verbose
(logical) Print progress messages? (default:
FALSE).- cmd_args
(character) Additional arguments appended to the MMseqs2 command.
Value
If
behavior == "return_matrix": a tibble with columnstaxa_namesand one column per rank.If
behavior == "add_to_phyloseq": a new phyloseq object with amendedtax_table.
Details
This function is mainly a wrapper of the work of others. Please cite MMseqs2: Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics (2021).
Examples
if (FALSE) { # \dontrun{
ref_fasta <- Biostrings::readDNAStringSet(system.file("extdata",
"mini_UNITE_fungi.fasta.gz",
package = "MiscMetabar", mustWork = TRUE
))
# Preferred usage: provide a SINTAX-formatted FASTA file.
# The function searches with easy-search and parses SINTAX headers.
res <- assign_mmseqs2(data_fungi_mini, ref_fasta = ref_fasta)
head(res)
# Add taxonomy to phyloseq:
physeq_new <- assign_mmseqs2(
data_fungi_mini,
ref_fasta = ref_fasta,
behavior = "add_to_phyloseq"
)
} # }