Skip to contents

<a href="https://adrientaudiere.github.io/MiscMetabar/articles/Rules.html#lifecycle"> <img src="https://img.shields.io/badge/lifecycle-experimental-orange" alt="lifecycle-experimental"></a>

The idea is to take the binomial taxonomic name assigned to each ASV/OTU at the Genus_species level, search for sequences in NCBI nucleotide database corresponding to this taxon name (with some additional filters including the marker name), retrieve the sequences in fasta format, and then perform a BLAST search of retrieved sequences against the ASV/OTU sequences.

We can therefore test for each ASV/OTU if the best BLAST hit corresponds to the same taxon name as the one assigned to the ASV/OTU. Moreover, we can also detect some cases where a better taxonomic assignment can be proposed based on the BLAST results limited to species name already present in the phyloseq object.

Note that this function need a physeq object and cannot works with a list of taxonomic names (taxnames is not a parameter of the function).

Usage

tax_retroblast_pq(
  physeq,
  taxonomic_rank = "currentCanonicalSimple",
  marker = NULL,
  id_cut = 99,
  retmax = 500,
  add_to_phyloseq = TRUE,
  verbose = TRUE,
  start_date = NULL,
  end_date = NULL,
  min_length = 300,
  max_length = 4000,
  refseq_only = FALSE,
  sup_params = "NOT uncultured[Title] NOT clone[Title]",
  discard_genus_alone = identical(taxonomic_rank, "currentCanonicalSimple"),
  discard_NA = TRUE,
  ...
)

Arguments

physeq

(required) A phyloseq object

taxonomic_rank

(required, default = "currentCanonicalSimple") The column(s) present in the @tax_table slot of the phyloseq object. Can be a vector of two columns (e.g. c("Genus", "Species")).

marker

(required) A character vector of marker names to be used in the search term. For example, c("ITS", "internal transcribed spacer") for fungal ITS sequences. Note that the marker names should be present in the title of the sequences in NCBI nucleotide database.

id_cut

(default: 99) minimum as a good match. A 100 value means that only perfect matches are considered as good matches.

retmax

(default: 500) maximum number of sequences to retrieve from NCBI nucleotide database for each taxon name.

add_to_phyloseq

(logical, default TRUE) If TRUE, a new phyloseq object is returned with new columns in the tax_table.

verbose

(logical, default TRUE) If TRUE, prompt some messages.

start_date

The start date for the search. If NULL (default), the search is not limited by date. The date must be in the format "YYYY-MM-DD".

end_date

() The end date for the search. If NULL (default), the search is not limited by date. If start_date is not NULL and end_date is NULL, the end_date is set to today's date. The date must be in the format "YYYY-MM-DD".

min_length

(int) Minimum sequence length to consider in the search.

max_length

(int) Maximum sequence length to consider in the search.

refseq_only

(logical, default FALSE) If TRUE, only sequences from the RefSeq database are retrieved. RefSeq is a curated non-redundant database of sequences from NCBI. If FALSE, all sequences from NCBI nucleotide database are retrieved. Note that using refseq_only = TRUE is experimental and may lead to no sequence retrieved for some taxon names.

sup_params

(char) Additional parameters to be added to the search term. By default set to ("NOT uncultured[Title] NOT clone[Title]") to exclude uncultured and clone sequences.

discard_genus_alone

(logical, default `TRUE` when `taxonomic_rank == "currentCanonicalSimple"`). Passed to [taxonomic_rank_to_taxnames()].

discard_NA

(logical, default `TRUE`). Passed to [taxonomic_rank_to_taxnames()].

...

Additional parameters to be passed to [MiscMetabar::blast_to_phyloseq()] including: `nproc`, `e_value_cut` and `args_blastn`

Value

Either a list (if add_to_phyloseq = FALSE) or a new phyloseq object, if add_to_phyloseq = TRUE, with new columns based on the `tib_retroblast` tibble describe below:

The list is composed of two elements: 1. `tib_retroblast`: A tibble with one row for each taxa of the phyloseq object: - `blast_queried`: (logical) queried names for sequences - `blast_result`: (logical) Number of queried names with at least one blast result - `good_assign`: (logical) Number of good assignation (best blast hit with as the one assigned to the ASV/OTU) - `alt_assign`: Number of alternative assignation proposed (best blast hit with the phyloseq object) - `taxa_name`: Taxonomic name used to query NCBI nucleotide database

2. `entrez_search`: A list of the rentrez::entrez_search results for each taxon name

See also

[MiscMetabar::blast_to_phyloseq()], [rentrez::entrez_search()]

Author

Adrien Taudiere

Examples

if (FALSE) { # \dontrun{
data_fungi_mini_cleanNames <-
  gna_verifier_pq(data_fungi_mini,
    data_source = 210)

res_retro <- tax_retroblast_pq(data_fungi_mini_cleanNames,
  marker = c("ITS", "internal transcribed spacer"),
  retmax = 10,
   id_cut = 99,
   add_to_phyloseq = FALSE
)

res_retro$tib_retroblast |>
  summarise(
    prop_good_assign = sum(good_assign) / sum(blast_result),
    n_alt_assign = sum(!is.na(alt_assign))
  )

table(res_retro$tib_retroblast$alt_assign)

res_retro_100 <- tax_retroblast_pq(data_fungi_mini_cleanNames,
  marker = c("ITS", "internal transcribed spacer"),
  retmax = 100, id_cut = 100
)

# nb of queried names for sequences (id=100%)
res_retro_100$tib_retroblast$blast_queried |> sum()
# nb of queried names with at least one blast result (id=100%)
res_retro_100$tib_retroblast$blast_result |> sum()
# nb of good assignation (id=100%)
res_retro_100$tib_retroblast$good_assign |> sum()
# nb of alternative assignation proposed (id=100%)
res_retro_100$tib_retroblast$alt_assign |>
  is.na() |>
  sapply(isFALSE) |>
  sum()

} # }