Verify (and fix) scientific names (Genus species) of a phyloseq object.
Source:R/gna_verifier_pq.R
gna_verifier_pq.Rd<a href="https://adrientaudiere.github.io/MiscMetabar/articles/Rules.html#lifecycle"> <img src="https://img.shields.io/badge/lifecycle-maturing-blue" alt="lifecycle-maturing"></a>
A wrapper of [taxize::gna_verifier()] apply to phyloseq object
Usage
gna_verifier_pq(
physeq = NULL,
taxnames = NULL,
taxonomic_rank = c("Genus", "Species"),
data_sources = c(1, 12),
all_matches = FALSE,
capitalize = FALSE,
species_group = FALSE,
fuzzy_uninomial = FALSE,
verbose = TRUE,
add_to_phyloseq = NULL,
col_prefix = NULL,
genus_species_canonical_col = TRUE,
year_col = TRUE,
authorship_col = TRUE,
discard_NA = TRUE,
problematic_chars = "[?\\\\#|&]",
clean_problematic_chars = FALSE,
force_recompute = FALSE,
species_only = TRUE
)Arguments
- physeq
(optional) A phyloseq object. Either `physeq` or `taxnames` must be provided, but not both.
- taxnames
(optional) A character vector of taxonomic names.
- taxonomic_rank
(Character) The column(s) present in the @tax_table slot of the phyloseq object. Can be a vector of two columns (e.g. the default c("Genus", "Species")).
- data_sources
A character or integer vector. See [taxize::gna_verifier()] documentation. For example, 1=Catalogue of Life, 3=ITIS, 5=Index Fungarum, 11=GBIF backbone and 210=TaxRef.
- all_matches
(Logical) See [taxize::gna_verifier()] documentation.
- capitalize
(Logical) See [taxize::gna_verifier()] documentation.
- species_group
(Logical) See [taxize::gna_verifier()] documentation.
- fuzzy_uninomial
(Logical) See [taxize::gna_verifier()] documentation.
- verbose
(logical, default TRUE) If TRUE, prompt some messages.
- add_to_phyloseq
(logical, default TRUE when physeq is provided, FALSE when taxnames is provided)
- If FALSE, return a cleaned tibble derived from [taxize::gna_verifier()] output, with columns `submittedName`, `currentName`, `currentCanonicalSimple` (and `genusEpithet`/`specificEpithet` when `genus_species_canonical_col = TRUE`, `namePublishedInYear` when `year_col = TRUE`, authorship columns when `authorship_col = TRUE`), plus a column `taxa_names_in_phyloseq` with the original taxon name from the phyloseq object (or `NULL` when `taxnames` is provided directly).
- If TRUE return a phyloseq object with amended slot `@taxtable`. Cannot be TRUE if `taxnames` is provided. At least three new columns are added: - **taxa_name**: The character string sent to gna_verifier (e.g. `Antrodiella brasiliensis`) - **currentName**: The current accepted name (resolve the synonym) with autorities at the end of the binominal name (e.g. `Trametopsis brasiliensis (Ryvarden & de Meijer) Gomez-Mont. & Robledo)`. - **currentCanonicalSimple**: The current accepted name without autorities (e.g. `Trametopsis brasiliensis`, `Russula`).
Other columns can be added depending on the parameters: `genus_species_canonical_col` (adds "genusEpithet", "specificEpithet", and "genusSpeciesEpithet"), `year_col`, `authorship`.
- col_prefix
A character string to be added as a prefix to the new columns names added to the tax_table slot of the phyloseq object (default: NULL).
- genus_species_canonical_col
(logical, default TRUE) If TRUE three new columns are added along with "currentCanonicalSimple": "genusEpithet", "specificEpithet" and "genusSpeciesEpithet". "genusSpeciesEpithet" is identical to "currentCanonicalSimple" but is NA when "specificEpithet" is NA or empty (i.e. genus-only names are excluded).
- year_col
(logical, default TRUE) If TRUE a new column "namePublishedInYear" is added with the year of publication.
(logical, default TRUE) If TRUE three new columns are added: "authorship", "bracketauthorship" and "scientificNameAuthorship".
- discard_NA
(logical, default `TRUE`). Passed to [taxonomic_rank_to_taxnames()].
- problematic_chars
A regex pattern (character string) to detect characters that are problematic for the GNA Verifier API URL. The API pastes names pipe-separated into a GET URL path, so characters like `?` (query-string delimiter), `\` (escape), `|` (pipe separator), `#` (fragment), or `&` (parameter separator) corrupt the URL and can cause a length-mismatch crash in [taxize::gna_verifier()]. Names containing these characters are reported and, if `clean_problematic_chars` is `TRUE`, handled before verification. Set to `NULL` to disable detection. Default: `"[?\\#|&]"`.
- clean_problematic_chars
(logical, default `FALSE`) If `TRUE`, cells in the `taxonomic_rank` columns that match `problematic_chars` are replaced with `NA` (when `physeq` is provided) and matching names are filtered out (when `taxnames` is provided) before verification. If `FALSE` (the default), a warning is issued listing the problematic names but they are sent as-is – this will likely cause an error in [taxize::gna_verifier()]. Set to `TRUE` to handle them automatically, or clean the data upstream (e.g. with [MiscMetabar::simplify_taxo()]).
- force_recompute
(logical, default `FALSE`) If `TRUE`, remove any existing columns in the `tax_table` that would be re-added by this call (i.e. columns matching `col_prefix` when `col_prefix` is set, or columns in `new_cols` when `col_prefix` is `NULL`) before performing the verification. This is useful when re-running `gna_verifier_pq()` on a phyloseq that already contains result columns from a previous call. If `FALSE`, existing columns are left in place, which can cause duplicate-column errors in `tax_table()` on re-runs.
- species_only
(logical, default TRUE) If TRUE, `currentCanonicalSimple` is set to `NA` for uninomial names (i.e. when `matchedCardinality == 1`, meaning only a genus or higher-rank name was matched, not a proper species binomial). The `genusEpithet` column is always populated from the matched name regardless of this setting. The `specificEpithet` column is always `NA` for uninomials, independently of this parameter.
Value
Either a tibble (if add_to_phyloseq = FALSE) or a new phyloseq object with new columns (see param add_to_phyloseq) in the tax_table slot.
Examples
if (FALSE) { # \dontrun{
df <- gna_verifier_pq(data_fungi, data_sources = 210, add_to_phyloseq = FALSE)
data_fungi_mini_cleanNames <- gna_verifier_pq(data_fungi_mini, data_sources = 210)
data_fungi_cleanNames <- gna_verifier_pq(data_fungi, data_sources = 210)
sum(!is.na(data_fungi_cleanNames@tax_table[, "currentName"]))
sum(data_fungi_cleanNames@tax_table[, "currentCanonicalSimple"] !=
data_fungi_cleanNames@tax_table[, "taxa_name"], na.rm = TRUE)
# 1010 taxa (71% of total) are identified using a currentName including 434
# corrected values (correction using synonym disambiguation)
tr <- rotl_pq(data_fungi_cleanNames,
taxonomic_rank = "currentCanonicalSimple",
context_name = "Basidiomycetes"
)
p <- ggtree::ggtree(tr, layout = "roundrect") +
ggtree::geom_nodelab(hjust = 1, vjust = -1.2, size = 2) +
ggtree::geom_tiplab(size = 2)
p + xlim(0, max(p$data$x) + 1)
psmelt(data_fungi_mini_cleanNames) |>
filter(Abundance > 0) |>
mutate(namePublishedInYear = as.numeric(namePublishedInYear)) |>
pull(namePublishedInYear) |>
hist(breaks = 100)
# Does the fungal species discovered more recently tend to be found at
# greater heights in the tree?
psmelt(data_fungi_mini_cleanNames) |>
filter(Abundance > 0) |>
group_by(Height) |>
mutate(namePublishedInYear = as.numeric(namePublishedInYear)) |>
ggstatsplot::ggbetweenstats("Height", "namePublishedInYear")
} # }