Scans a taxonomy table for common problematic values such as NA-like
strings, placeholder labels ("unclassified", "unknown", etc.),
and empty QIIME-style rank prefixes. The input can be a
phyloseq object or a FASTA reference database
file.
Returns a tibble summarising, for each pattern found, how many matches occur in each taxonomic rank.
Usage
count_unwanted_tax(
x,
patterns = unwanted_tax_patterns_default(),
tax_format = "auto"
)Arguments
- x
Either:
a character string giving the path to a FASTA file (plain or gzip), or
a phyloseq object with a taxonomy table.
- patterns
(Character vector) Regular expressions to search for. When MiscMetabar is installed, defaults to MiscMetabar::unwanted_tax_patterns; otherwise falls back to a built-in copy of the same patterns. See Details.
- tax_format
(Character) Taxonomy format of the FASTA file. One of
"unite","sintax","greengenes2","pr2", or"auto". Only used whenxis a file path. If"auto"(default), the format is detected withdetect_tax_format(). Ignored whenxis a phyloseq object.
Value
A tibble with columns:
patternThe regular expression that matched.
descriptionA human-readable label for the pattern.
rankThe taxonomic rank (column name) where matches were found.
n_matchesNumber of values matching the pattern in that rank.
example_valuesUp to 3 unique matching values (comma-separated).
Rows with zero matches are omitted.
Details
The default patterns are:
"^[Nn][Aa][Nn]?$"NaN, nan, NA, na
"^[Nn]/[Aa]$"N/A, n/a
"^[Nn]one$"None, none
"^$"empty string
"^\\s+$"whitespace only
"[Uu]nclassified"unclassified, Unclassified, xxx_unclassified
"[Uu]nknown"unknown, Unknown, xxx_unknown
"[Uu]nidentified"unidentified, Unidentified
"[Uu]ncultured"uncultured, Uncultured
"[Ii]ncertae[_\\s]?[Ss]edis"incertae_sedis, Incertae sedis
"^[Mm]etagenome$"metagenome, Metagenome
"^[Ee]nvironmental"environmental, Environmental
"^[kpcofgs]__$"empty QIIME-style rank prefixes