Convert a FASTA database to a specified taxonomy format

Detects (or uses) the input taxonomy format and rewrites sequence headers to the requested output format. This is the primary conversion function; format2sintax(), format2dada2(), and format2dada2_species() are convenience wrappers around it.

Supported input formats (prefix-based, with detectable rank labels): "sintax", "unite", "greengenes2".

Supported output formats:

"sintax" — VSEARCH/USEARCH SINTAX (>ID;tax=k:Kingdom,p:Phylum,...)
"unite" — UNITE default (>ID;k__Kingdom;p__Phylum;...)
"greengenes2" — Greengenes2 (>ID d__Domain;p__Phylum;...)
"dada2" — Unprefixed semicolon-delimited (>Kingdom;Phylum;...;)
"dada2_species" — For dada2::addSpecies() (>ID Genus Species)

Positional formats ("pr2", "dada2") can be detected by detect_tax_format() but cannot be used as input for conversion because they lack rank labels.

Usage

format_fasta_db(
  fasta_db = NULL,
  taxnames = NULL,
  output_format = c("sintax", "unite", "greengenes2", "dada2", "dada2_species"),
  input_format = "auto",
  output_path = NULL,
  id_prefix = "seq"
)

Arguments

fasta_db: (Character) Path to a FASTA file (plain or gzipped). Mutually exclusive with taxnames.
taxnames: (Character vector) Taxonomy header strings (without leading >). Mutually exclusive with fasta_db.
output_format: (Character) Target format. One of "sintax", "unite", "greengenes2", "dada2", "dada2_species".
input_format: (Character, default "auto") Input format. One of "auto" (auto-detect via detect_tax_format()), "sintax", "unite", "greengenes2", "dada2". The positional "dada2" input (taxonomy-only headers, no sequence ID) is assigned ranks by position (d,p,c,o,f,g,s); see id_prefix for the generated labels.
output_path: (Character) If provided and fasta_db is used, write the reformatted FASTA to this path and return the DNAStringSet invisibly.
id_prefix: (Character, default "seq") Prefix used to build synthetic sequential sequence IDs (e.g. "seq000001") for input formats that carry no per-sequence identifier ("dada2"). Ignored when the input already provides IDs.

Value

If taxnames is used, a character vector of reformatted headers. If fasta_db is used, a DNAStringSet with reformatted names (invisibly when output_path is given).

Author

Adrien Taudière

Examples

# UNITE → SINTAX
format_fasta_db(
  taxnames = "AB123;k__Fungi;p__Ascomycota;c__Sordariomycetes",
  output_format = "sintax"
)
#> [1] "AB123;tax=k:Fungi,p:Ascomycota,c:Sordariomycetes"

# SINTAX → UNITE
format_fasta_db(
  taxnames = "AB123;tax=k:Fungi,p:Ascomycota,c:Sordariomycetes",
  output_format = "unite"
)
#> [1] "AB123;k__Fungi;p__Ascomycota;c__Sordariomycetes"

# Greengenes2 → dada2
format_fasta_db(
  taxnames = "abc123 d__Bacteria;p__Pseudomonadota;g__Escherichia",
  output_format = "dada2"
)
#> [1] "Bacteria;Pseudomonadota;Escherichia;"