Skip to contents

<a href="https://adrientaudiere.github.io/MiscMetabar/articles/Rules.html#lifecycle"> <img src="https://img.shields.io/badge/lifecycle-maturing-blue" alt="lifecycle-maturing"></a>

A wrapper of [taxize::gna_verifier()] apply to phyloseq object

Usage

gna_verifier_pq(
  physeq = NULL,
  taxnames = NULL,
  taxonomic_rank = c("Genus", "Species"),
  data_sources = c(1, 12),
  all_matches = FALSE,
  capitalize = FALSE,
  species_group = FALSE,
  fuzzy_uninomial = FALSE,
  verbose = TRUE,
  add_to_phyloseq = NULL,
  col_prefix = NULL,
  genus_species_canonical_col = TRUE,
  year_col = TRUE,
  authorship_col = TRUE,
  discard_NA = TRUE,
  problematic_chars = "[?\\\\#|&]",
  clean_problematic_chars = FALSE,
  force_recompute = FALSE,
  species_only = TRUE
)

Arguments

physeq

(optional) A phyloseq object. Either `physeq` or `taxnames` must be provided, but not both.

taxnames

(optional) A character vector of taxonomic names.

taxonomic_rank

(Character) The column(s) present in the @tax_table slot of the phyloseq object. Can be a vector of two columns (e.g. the default c("Genus", "Species")).

data_sources

A character or integer vector. See [taxize::gna_verifier()] documentation. For example, 1=Catalogue of Life, 3=ITIS, 5=Index Fungarum, 11=GBIF backbone and 210=TaxRef.

all_matches

(Logical) See [taxize::gna_verifier()] documentation.

capitalize

(Logical) See [taxize::gna_verifier()] documentation.

species_group

(Logical) See [taxize::gna_verifier()] documentation.

fuzzy_uninomial

(Logical) See [taxize::gna_verifier()] documentation.

verbose

(logical, default TRUE) If TRUE, prompt some messages.

add_to_phyloseq

(logical, default TRUE when physeq is provided, FALSE when taxnames is provided)

- If FALSE, return a cleaned tibble derived from [taxize::gna_verifier()] output, with columns `submittedName`, `currentName`, `currentCanonicalSimple` (and `genusEpithet`/`specificEpithet` when `genus_species_canonical_col = TRUE`, `namePublishedInYear` when `year_col = TRUE`, authorship columns when `authorship_col = TRUE`), plus a column `taxa_names_in_phyloseq` with the original taxon name from the phyloseq object (or `NULL` when `taxnames` is provided directly).

- If TRUE return a phyloseq object with amended slot `@taxtable`. Cannot be TRUE if `taxnames` is provided. At least three new columns are added: - **taxa_name**: The character string sent to gna_verifier (e.g. `Antrodiella brasiliensis`) - **currentName**: The current accepted name (resolve the synonym) with autorities at the end of the binominal name (e.g. `Trametopsis brasiliensis (Ryvarden & de Meijer) Gomez-Mont. & Robledo)`. - **currentCanonicalSimple**: The current accepted name without autorities (e.g. `Trametopsis brasiliensis`, `Russula`).

Other columns can be added depending on the parameters: `genus_species_canonical_col` (adds "genusEpithet", "specificEpithet", and "genusSpeciesEpithet"), `year_col`, `authorship`.

col_prefix

A character string to be added as a prefix to the new columns names added to the tax_table slot of the phyloseq object (default: NULL).

genus_species_canonical_col

(logical, default TRUE) If TRUE three new columns are added along with "currentCanonicalSimple": "genusEpithet", "specificEpithet" and "genusSpeciesEpithet". "genusSpeciesEpithet" is identical to "currentCanonicalSimple" but is NA when "specificEpithet" is NA or empty (i.e. genus-only names are excluded).

year_col

(logical, default TRUE) If TRUE a new column "namePublishedInYear" is added with the year of publication.

authorship_col

(logical, default TRUE) If TRUE three new columns are added: "authorship", "bracketauthorship" and "scientificNameAuthorship".

discard_NA

(logical, default `TRUE`). Passed to [taxonomic_rank_to_taxnames()].

problematic_chars

A regex pattern (character string) to detect characters that are problematic for the GNA Verifier API URL. The API pastes names pipe-separated into a GET URL path, so characters like `?` (query-string delimiter), `\` (escape), `|` (pipe separator), `#` (fragment), or `&` (parameter separator) corrupt the URL and can cause a length-mismatch crash in [taxize::gna_verifier()]. Names containing these characters are reported and, if `clean_problematic_chars` is `TRUE`, handled before verification. Set to `NULL` to disable detection. Default: `"[?\\#|&]"`.

clean_problematic_chars

(logical, default `FALSE`) If `TRUE`, cells in the `taxonomic_rank` columns that match `problematic_chars` are replaced with `NA` (when `physeq` is provided) and matching names are filtered out (when `taxnames` is provided) before verification. If `FALSE` (the default), a warning is issued listing the problematic names but they are sent as-is – this will likely cause an error in [taxize::gna_verifier()]. Set to `TRUE` to handle them automatically, or clean the data upstream (e.g. with [MiscMetabar::simplify_taxo()]).

force_recompute

(logical, default `FALSE`) If `TRUE`, remove any existing columns in the `tax_table` that would be re-added by this call (i.e. columns matching `col_prefix` when `col_prefix` is set, or columns in `new_cols` when `col_prefix` is `NULL`) before performing the verification. This is useful when re-running `gna_verifier_pq()` on a phyloseq that already contains result columns from a previous call. If `FALSE`, existing columns are left in place, which can cause duplicate-column errors in `tax_table()` on re-runs.

species_only

(logical, default TRUE) If TRUE, `currentCanonicalSimple` is set to `NA` for uninomial names (i.e. when `matchedCardinality == 1`, meaning only a genus or higher-rank name was matched, not a proper species binomial). The `genusEpithet` column is always populated from the matched name regardless of this setting. The `specificEpithet` column is always `NA` for uninomials, independently of this parameter.

Value

Either a tibble (if add_to_phyloseq = FALSE) or a new phyloseq object with new columns (see param add_to_phyloseq) in the tax_table slot.

Details

This function is mainly a wrapper of the work of others. Please cite `taxize` package.

See also

[taxize::gna_verifier()]

Author

Adrien Taudiere

Examples

if (FALSE) { # \dontrun{
df <- gna_verifier_pq(data_fungi, data_sources = 210, add_to_phyloseq = FALSE)

data_fungi_mini_cleanNames <- gna_verifier_pq(data_fungi_mini, data_sources = 210)

data_fungi_cleanNames <- gna_verifier_pq(data_fungi, data_sources = 210)

sum(!is.na(data_fungi_cleanNames@tax_table[, "currentName"]))
sum(data_fungi_cleanNames@tax_table[, "currentCanonicalSimple"] !=
  data_fungi_cleanNames@tax_table[, "taxa_name"], na.rm = TRUE)
# 1010 taxa (71% of total) are identified using a currentName including 434
# corrected values (correction using synonym disambiguation)


tr <- rotl_pq(data_fungi_cleanNames,
  taxonomic_rank = "currentCanonicalSimple",
  context_name = "Basidiomycetes"
)

p <- ggtree::ggtree(tr, layout = "roundrect") +
  ggtree::geom_nodelab(hjust = 1, vjust = -1.2, size = 2) +
  ggtree::geom_tiplab(size = 2)

p + xlim(0, max(p$data$x) + 1)


psmelt(data_fungi_mini_cleanNames) |>
  filter(Abundance > 0) |>
  mutate(namePublishedInYear = as.numeric(namePublishedInYear)) |>
  pull(namePublishedInYear) |>
  hist(breaks = 100)


# Does the fungal species discovered more recently tend to be found at
# greater heights in the tree?
psmelt(data_fungi_mini_cleanNames) |>
  filter(Abundance > 0) |>
  group_by(Height) |>
  mutate(namePublishedInYear = as.numeric(namePublishedInYear)) |>
  ggstatsplot::ggbetweenstats("Height", "namePublishedInYear")
} # }