Only tested with Unite and Eukaryome fasta file for the moment. Rely on the presence of the pattern pattern_tax default "k__" to format the header.
A reference database in sintax format contain taxonomic information in the header of each sequence in the form of a string starting with ";tax=" and followed by a comma-separated list of up to nine taxonomic identifiers. Each taxonomic identifier must start with an indication of the rank by one of the letters d (for domain) k (kingdom), p (phylum), c (class), o (order), f (family), g (genus), s (species), or t (strain). The letter is followed by a colon (:) and the name of that rank. Commas and semicolons are not allowed in the name of the rank. Non-ascii characters should be avoided in the names.
Example:
\>X80725_S000004313;tax=d:Bacteria,p:Proteobacteria,c:Gammaproteobacteria,o:Enterobacteriales,f:Enterobacteriaceae,g:Escherichia/Shigella,s:Escherichia_coli,t:str._K-12_substr._MG1655
Usage
format2sintax(
fasta_db = NULL,
taxnames = NULL,
pattern_tax = "k__",
pattern_sintax = "tax=k:",
output_path = NULL
)
Arguments
- fasta_db
A link to a fasta files
- taxnames
A list of names to format. You must specify either fasta_db OR taxnames, not both.
- pattern_tax
(default "k__") The pattern to replace by pattern_sintax.
- pattern_sintax
(default "tax=k:") Useless for most users. Sometimes you may want to replacte by "tax=d:" (d for domain instead of kingdom).
- output_path
(optional) A path to an output fasta files. Only used if fasta_db is set.