Getting Started with comparpq

Introduction

comparpq is an extension package to {MiscMetabar} and {phyloseq} designed for comparing phyloseq objects. This package is particularly useful for:

Evaluating taxonomic assignment accuracy using mock communities
Creating interactive visualizations of microbiome data
Comparing multiple taxonomic databases or assignment methods
Data preprocessing and quality control

Installation

You can install comparpq from GitHub:

# Install from GitHub
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("adrientaudiere/comparpq")

Loading Required Libraries

library(comparpq)

# The package automatically loads dependencies including:
# - phyloseq for handling microbiome data
# - MiscMetabar for additional microbiome analysis tools

Working with Example Data

The package includes the Glom_otu dataset, which contains glomalean (arbuscular mycorrhizal fungi) OTU data:

Glom_otu
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 1147 taxa and 444 samples ]
#> sample_data() Sample Data:       [ 444 samples by 118 sample variables ]
#> tax_table()   Taxonomy Table:    [ 1147 taxa by 14 taxonomic ranks ]
#> refseq()      DNAStringSet:      [ 1147 reference sequences ]

# Basic information about the dataset
cat("Number of taxa:", phyloseq::ntaxa(Glom_otu), "\n")
#> Number of taxa: 1147
cat("Number of samples:", phyloseq::nsamples(Glom_otu), "\n")
#> Number of samples: 444
cat("Taxonomic ranks:", paste(phyloseq::rank_names(Glom_otu), collapse = ", "), "\n")
#> Taxonomic ranks: Kingdom, Phyla, Class, Order, Family, Genus, Species, Kingdom__eukaryome_Glomero, Phylum__eukaryome_Glomero, Class__eukaryome_Glomero, Order__eukaryome_Glomero, Family__eukaryome_Glomero, Genus__eukaryome_Glomero, Species__eukaryome_Glomero

Interactive Visualizations

Bubble Plots with Observable HQ

One of the key features of comparpq is the ability to create interactive bubble plots using Observable HQ:

# Basic bubble plot
bubbles_pq(Glom_otu)

# Bubble plot colored by Family
bubbles_pq(Glom_otu,
  rank_color = "Family",
  min_nb_seq = 1000
)

# Customized bubble plot with different color scheme
bubbles_pq(Glom_otu,
  rank_color = "Class",
  categorical_scheme = "d3.schemePastel1",
  min_nb_seq = 500,
  log1ptransform = TRUE
)

Traditional Plots

For non-interactive visualizations, you can use:

# Create comparison matrices for visualization
matrix_data <- tc_points_matrix(Glom_otu)

# Bar chart comparison (requires appropriate data structure)
# tc_bar(matrix_data)

# Circular comparison plot
# tc_circle(matrix_data)

Data Manipulation and Preprocessing

Working with Taxonomic Tables

# Select specific taxonomic ranks
Glom_otu_subset <- select_ranks_pq(Glom_otu, Kingdom, Genus)
cat("Original ranks:", paste(phyloseq::rank_names(Glom_otu), collapse = ", "), "\n")
#> Original ranks: Kingdom, Phyla, Class, Order, Family, Genus, Species, Kingdom__eukaryome_Glomero, Phylum__eukaryome_Glomero, Class__eukaryome_Glomero, Order__eukaryome_Glomero, Family__eukaryome_Glomero, Genus__eukaryome_Glomero, Species__eukaryome_Glomero
cat("Selected ranks:", paste(phyloseq::rank_names(Glom_otu_subset), collapse = ", "), "\n")
#> Selected ranks: Kingdom, Genus

# Rename taxonomic ranks
renamed_physeq <- rename_ranks_pq(Glom_otu,
  new_names = c(
    "Domain", "Division", "Class_level",
    "Order_level", "Family_level", "Genus_level"
  )
)
cat("Renamed ranks:", paste(phyloseq::rank_names(renamed_physeq), collapse = ", "), "\n")
#> Renamed ranks: Kingdom, Phyla, Class, Order, Family, Genus, Species, Kingdom__eukaryome_Glomero, Phylum__eukaryome_Glomero, Class__eukaryome_Glomero, Order__eukaryome_Glomero, Family__eukaryome_Glomero, Genus__eukaryome_Glomero, Species__eukaryome_Glomero

Pattern Replacement and Data Cleaning

# Replace unwanted patterns with NA
# This is useful for cleaning taxonomic assignments
cleaned_physeq <- taxtab_replace_pattern_by_NA(Glom_otu,
  pattern = "^uncultured",
  taxonomic_ranks = "Genus"
)

# Check how many taxa were affected
original_genus <- phyloseq::tax_table(Glom_otu)[, "Genus"]
cleaned_genus <- phyloseq::tax_table(cleaned_physeq)[, "Genus"]
cat(
  "Uncultured genera replaced with NA:",
  sum(is.na(cleaned_genus)) - sum(is.na(original_genus)), "\n"
)
#> Uncultured genera replaced with NA: 0

Mock Community Analysis

Creating Fake Taxa for Accuracy Assessment

To evaluate taxonomic assignment accuracy, you can add fake taxa to your dataset:

# Add shuffled sequences (creates fake taxa from existing sequences)
physeq_with_shuffled <- add_shuffle_seq_pq(Glom_otu, n_fake = 10)
cat("Original taxa count:", phyloseq::ntaxa(Glom_otu), "\n")
#> Original taxa count: 1147
cat("With fake taxa:", phyloseq::ntaxa(physeq_with_shuffled), "\n")
#> With fake taxa: 1157

# Add external sequences (simulates contamination or database gaps)
external_seq <- Biostrings::readDNAStringSet(
  system.file("extdata/ex_little.fasta", package = "MiscMetabar")
)
physeq_with_external <- add_external_seq_pq(Glom_otu, ext_seqs = external_seq)
cat("With external sequences added:", phyloseq::ntaxa(physeq_with_external), "\n")
#> With external sequences added: 1149

Accuracy Metrics Computation

# Example of computing accuracy metrics
# This requires properly formatted comparison data

# Step 1: Prepare your comparison data structure
# ranks_df should contain your taxonomic assignments to evaluate
# true_values_df should contain the known correct assignments

# Example structure (replace with your actual data):
# ranks_df <- data.frame(
#   method1 = c("Fungi", "Ascomycota", "Sordariomycetes"),
#   method2 = c("Fungi", "Basidiomycota", "Agaricomycetes")
# )
#
# true_values_df <- data.frame(
#   Kingdom = c("Fungi", "Fungi"),
#   Phylum = c("Glomeromycota", "Glomeromycota")
# )

# Step 2: Compute metrics
# accuracy_results <- tc_metrics_mock(physeq_with_shuffled,
#                                   ranks_df = ranks_df,
#                                   true_values_df = true_values_df,
#                                   fake_taxa = TRUE)

Visualization of NA Patterns

The package includes specialized visualization for understanding missing taxonomic data:

# Visualize NA patterns in taxonomic assignments
rainplot_taxo_na(Glom_otu)

Resolving Taxonomic Conflicts

When working with multiple taxonomic databases, conflicts may arise:

# Resolve taxonomic conflicts between different assignment methods
# This function helps when you have conflicting taxonomic assignments
resolved_physeq <- resolve_taxo_conflict(Glom_otu,
  conflict_rank = "Genus",
  resolve_method = "majority"
)

Advanced Usage Tips

Workflow for Method Comparison

Prepare your data: Start with a phyloseq object containing your OTU/ASV data
Add fake taxa: Use add_shuffle_seq_pq() and add_external_seq_pq() to create test cases
Perform taxonomic assignments: Apply different methods/databases to your data
Compare results: Use tc_metrics_mock() to quantify accuracy
Visualize: Create plots using bubbles_pq(), tc_bar(), or tc_circle()

Data Quality Control

# Example quality control workflow
qc_physeq <- Glom_otu

# 1. Clean up problematic annotations
qc_physeq <- taxtab_replace_pattern_by_NA(qc_physeq,
  pattern = "unknown|unidentified|uncultured",
  taxonomic_ranks = c("Family", "Genus")
)

# 2. Select relevant taxonomic ranks
qc_physeq <- select_ranks_pq(qc_physeq,
  taxonomic_ranks = c("Kingdom", "Phyla", "Class", "Order", "Family")
)

# 3. Standardize rank names
qc_physeq <- rename_ranks_pq(qc_physeq,
  old_names = c("Kingdom", "Phyla", "Class", "Order", "Family"),
  new_names = c("Kingdom__", "Phyla__", "Class__", "Order__", "Family__")
)

cat("Quality control completed. Final dataset has", phyloseq::ntaxa(qc_physeq), "taxa\n")
#> Quality control completed. Final dataset has 1147 taxa

Getting Help

Function documentation: Use ?function_name for detailed help on any function
Package website: Visit https://adrientaudiere.github.io/comparpq/
Report issues: https://github.com/adrientaudiere/comparpq/issues
Related packages:
- {MiscMetabar}: Extended microbiome analysis
- {phyloseq}: Core microbiome data handling

Summary

The comparpq package provides a comprehensive toolkit for:

Interactive visualization of phyloseq objects using modern web technologies
Accuracy assessment of taxonomic assignment methods using mock communities
Data preprocessing and quality control for microbiome analyses
Comparative analysis between different taxonomic databases or assignment approaches

This makes it particularly valuable for researchers working on taxonomic assignment method development, database comparison, and microbiome data quality assessment.

Adrien Taudière

2026-03-13