This function creates synthetic chimeric sequences by combining parts of
existing sequences from a phyloseq object. Useful for benchmarking chimera
detection methods like MiscMetabar::chimera_removal_vs() or
chimera_removal_dada2().
Usage
create_chimera_pq(
physeq,
n_chimeras = 5,
prop_mean = 0.5,
prop_sd = 0.15,
prop_min = 0.1,
seed = 123,
median_abundance_multiplier = 0.1,
ensure_distinct_parents = TRUE,
min_parent_distance = 0.1
)Arguments
- physeq
(phyloseq, required) A phyloseq object with a refseq slot containing DNA sequences.
- n_chimeras
(integer, default: 5) Number of chimeric sequences to create.
- prop_mean
(numeric, default: 0.5) Mean of the normal distribution used to sample the proportion of the first parent sequence. A value of 0.5 means chimeras will be centered around 50/50 splits.
- prop_sd
(numeric, default: 0.15) Standard deviation of the normal distribution used to sample proportions. Higher values create more variable chimera breakpoints.
- prop_min
(numeric, default: 0.1) Minimum proportion threshold. Proportions below this value (or above 1 - prop_min) are resampled to ensure each parent contributes meaningfully to the chimera.
- seed
(integer, default: 123) Random seed for reproducibility.
- median_abundance_multiplier
(numeric, default: 0.1) Multiplier to set the abundance of chimeric sequences relative to the median abundance of existing sequences. A value of 0.1 means chimeras will have approximately 10% of the median abundance.
- ensure_distinct_parents
(logical, default: TRUE) If TRUE, ensures that parent2 is sufficiently different from parent1 based on
min_parent_distance. Chimeras created from very similar parent sequences may be undetectable.- min_parent_distance
(numeric, default: 0.1) Minimum sequence distance (proportion of differing positions) between parent1 and parent2. Only used when
ensure_distinct_parents = TRUE. Values typically range from 0.05 (5% divergence) to 0.3 (30% divergence).
Value
A list containing:
- physeq
The new phyloseq object with added chimeric sequences
- chimera_names
Character vector of chimera taxa names
- parent_info
Data frame with details about each chimera: chimera name, parent1, parent2, parent_distance, prop_parent1, breakpoint, seq_length
- params
List of parameters used (prop_mean, prop_sd, prop_min, ensure_distinct_parents, min_parent_distance)
Examples
if (FALSE) { # \dontrun{
library(MiscMetabar)
data(data_fungi)
# Default: centered around 50% with some variation
result <- create_chimera_pq(data_fungi, n_chimeras = 40)
data_fungi_test <- result$physeq
known_chimeras <- result$chimera_names
# View the parent information and proportions
print(result$parent_info)
# More variable proportions (wider distribution)
result2 <- create_chimera_pq(data_fungi, n_chimeras = 40,
prop_mean = 0.5, prop_sd = 0.25)
# Biased toward more of parent1 (e.g., 70/30 splits on average)
result3 <- create_chimera_pq(data_fungi, n_chimeras = 40,
prop_mean = 0.7, prop_sd = 0.1)
# Benchmark chimera detection methods
if (MiscMetabar::is_vsearch_installed()) {
nochim_vs <- MiscMetabar::chimera_removal_vs(data_fungi_test)
detected_vs <- known_chimeras[!known_chimeras %in% phyloseq::taxa_names(nochim_vs)]
cat("vsearch detected:", length(detected_vs), "/",
length(known_chimeras), "chimeras\n")
}
# Visualize the distribution of proportions
hist(result$parent_info$prop_parent1,
main = "Distribution of parent1 proportions",
xlab = "Proportion from parent1", xlim = c(0, 1))
# Ensure parents are at least 15% different (more detectable chimeras)
result4 <- create_chimera_pq(data_fungi, n_chimeras = 40,
min_parent_distance = 0.15)
# Disable parent distance filtering (allows similar parents)
result5 <- create_chimera_pq(data_fungi, n_chimeras = 40,
ensure_distinct_parents = FALSE)
} # }