Provides an overview of a FASTA reference database: number of sequences, sequence length distribution, and taxonomic coverage at each rank. Supports both prefix-based formats (unite, sintax, greengenes2) and positional formats (dada2, pr2).
Usage
summarize_db(
file,
rank_prefixes = c("k__", "p__", "c__", "o__", "f__", "g__", "s__"),
tax_format = NULL
)Arguments
- file
(Character, required) Path to a FASTA file (plain or gzip).
- rank_prefixes
(Character vector) Taxonomic rank prefixes to summarize. Defaults to kingdom through species. Ignored if
tax_formatis provided.- tax_format
(Character) If provided, one of
"unite","sintax","greengenes2", or"pr2". Overridesrank_prefixeswith the full set fromtax_prefixes(). If"auto", the format is detected from the file headers usingdetect_tax_format(). IfNULL(default),rank_prefixesis used as-is.
Value
A list with components:
n_sequences: total number of sequenceslength_summary: summary statistics of sequence lengthsranks: a named integer vector of annotated counts per rank
Examples
db <- system.file("extdata", "example_unite.fasta", package = "dbpq")
summarize_db(db)
#> Database: example_unite.fasta
#> Sequences: 5
#> Sequence length: 68-68 (mean: 68)
#> k: 5 sequences with annotation
#> p: 5 sequences with annotation
#> c: 5 sequences with annotation
#> o: 5 sequences with annotation
#> f: 5 sequences with annotation
#> g: 5 sequences with annotation
#> s: 5 sequences with annotation
summarize_db(db, tax_format = "unite")
#> Database: example_unite.fasta
#> Sequences: 5
#> Sequence length: 68-68 (mean: 68)
#> k: 5 sequences with annotation
#> p: 5 sequences with annotation
#> c: 5 sequences with annotation
#> o: 5 sequences with annotation
#> f: 5 sequences with annotation
#> g: 5 sequences with annotation
#> s: 5 sequences with annotation