Use Cases¶

Real-world applications of AlleleFlux for metagenomic evolution studies.

Antibiotic Resistance Evolution¶

Question: How do bacterial communities evolve under antibiotic treatment?

Design: Longitudinal fecal samples from antibiotic-treated vs. control mice (pre, during, post-treatment). 10 treated, 8 control mice with 3 samples each = 54 total samples. To analyze with AlleleFlux, we compare two timepoints at a time using separate configurations: pre→during and during→post.

Complete Configuration (pre→during comparison):

input:
   bam_dir: data/bam/
   fasta: data/reference/combined_mags.fa
   metadata: data/metadata/sample_metadata.tsv
   prodigal_path: data/prodigal/combined_genes.gff
   gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
   mag_mapping: data/mapping/mag_contig_mapping.tsv

 output:
   root_dir: output/antibiotic_resistance/pre_during/

 quality_control:
   breadth_threshold: 0.2           # Require 20% genome coverage
   min_sample_num: 6                # At least 6 samples per group-timepoint

 analysis:
   data_type: longitudinal
   timepoints_combinations:
     - timepoint: [pre, during]       # Compare pre → during treatment
       focus: during                  # Focus on during-treatment state
   groups_combinations:
     - [antibiotic, control]
   use_lmm: true
   use_significance_tests: true
   use_cmh: true

 statistics:
   p_value_threshold: 0.05
   fdr_method: fdr_bh
   fdr_threshold: 0.1

 resources:
   allele_freq:
     cpus: 8
     mem_mb: 8000
     time_min: 60
   statistical_tests:
     cpus: 16
     mem_mb: 16000
     time_min: 120

Additional Configuration (during→post comparison):

For the post-treatment phase, create a second config file with:

timepoints_combinations:
  - timepoint: [during, post]      # Compare during → post treatment
    focus: post                     # Focus on post-treatment state

Expected command:

# Phase 1: Early response to antibiotic
alleleflux run --config config_antibiotic_pre_during.yml --threads 16

# Phase 2: Late response and stabilization
alleleflux run --config config_antibiotic_during_post.yml --threads 16

Output files to examine:

pre_during/step1_output/eligibility_tables/eligibility_table_pre_during-antibiotic_control.tsv - Which MAGs pass QC
pre_during/step2_output/scores/mag_level_scores.tsv - Parallelism scores during initial treatment response
during_post/step2_output/scores/mag_level_scores.tsv - Parallelism scores during stabilization phase
Combined gene-level scores identifying resistance candidates across both phases
Combined statistical results controlling for mouse identity

Result interpretation:

Early response phase (pre→during): MAGs with parallelism scores >2-3% show rapid adaptation to antibiotic stress. Expect lower scores than late phase
Late stabilization phase (during→post): Scores may increase further if selection continues, or plateau if community has stabilized
Divergence between phases: Compare scores to identify which genes are selected early (rapid response) vs. late (fine-tuning). Genes present in both phases indicate core resistance mechanisms
Cross-group comparison: Compare antibiotic group evolution to control group (expect <2% parallelism in controls for both phases)
Gene-level outliers: Look for genes in CARD database (antibiotic resistance genes). Unknown genes are candidates for novel resistance mechanisms
CMH significance: p-values < 0.05 indicate consistent allele changes across individual mice, ruling out within-mouse noise
Example interpretation: If Bacteroides MAG shows parallelism score 3% in pre→during (treated group) with early outliers including tetR, and then 6% in during→post with additional outliers in a porin gene, this indicates: (1) early selection for direct resistance, (2) late selection for drug efflux optimization

Diet-Microbiome Adaptation¶

Question: How do microbiomes adapt to dietary changes?

Design: Two diet groups (high-fat vs. standard diet), 12 mice per group, sampled at baseline, week 1, and week 4 = 72 total samples

Complete Configuration:

input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/diet_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/diet_adaptation/baseline_week1/

quality_control:
  breadth_threshold: 0.1           # Lower threshold (may have diet-dependent coverage)
  min_sample_num: 4                # 12 mice per group, require 4 across samples

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [baseline, week1]
      focus: week1                 # Focus on initial dietary response
  groups_combinations:
    - [highfat, standard]
  use_lmm: true
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 12
    mem_mb: 12000
    time_min: 90
  statistical_tests:
    cpus: 12
    mem_mb: 12000
    time_min: 90

Expected command:

alleleflux run --config config_diet.yml --threads 12

Additional Configuration (week1→week4 comparison):

For the longer-term adaptation trajectory, create a second config file with:

output:
  root_dir: output/diet_adaptation/week1_week4/

timepoints_combinations:
  - timepoint: [week1, week4]      # Compare week1 → week4 adaptation
    focus: week4                     # Focus on steady-state adaptations

Output files to examine:

Phase 1: baseline_week1/step1_output/eligibility_tables/eligibility_table_baseline_week1-highfat_standard.tsv
Phase 2: week1_week4/step1_output/eligibility_tables/eligibility_table_week1_week4-highfat_standard.tsv
Phase 1 scores: baseline_week1/step2_output/scores/mag_level_scores.tsv - Early dietary response
Phase 2 scores: week1_week4/step2_output/scores/mag_level_scores.tsv - Stabilization phase
step2_output/scores/gene_scores.tsv - Metabolic genes under selection
step2_output/scores/taxa_scores.tsv - Phylum/Family level aggregation for dietary responders
step2_output/outliers/outlier_genes.tsv - Genes under selection, mapped to KEGG pathways

Result interpretation:

Two-phase adaptation: Phase 1 (baseline→week1) shows rapid initial response; Phase 2 (week1→week4) shows slower stabilization
Metabolic focus: Filter gene_scores.tsv for genes annotated with: carbohydrate transport, lipid metabolism, short-chain fatty acid synthesis
Taxa-level patterns: In high-fat group, expect high scores for Bacteroidetes (fiber fermenters) and Faecalibacterium (SCFA producers). Standard diet shows stable low scores in both phases
Phase comparison: Scores may increase, plateau, or decrease from phase 1 to phase 2 depending on adaptation kinetics
Example interpretation: A Roseburia MAG may show parallelism 3% in phase 1 (baseline→week1) with early outliers in carbohydrate transport, then 5% in phase 2 (week1→week4) with additional outliers in propionate synthesis genes. This indicates initial colonization of gut with subsequent metabolic fine-tuning

Host-Microbe Co-evolution¶

Question: How do host genotypes shape microbial evolution?

Design: Longitudinal samples from WT vs. knockout mice, 10 mice per group, 3 timepoints = 60 total samples

Complete Configuration:

input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/host_genotype_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/host_microbe/week0_week4/

quality_control:
  breadth_threshold: 0.15
  min_sample_num: 5                # 10 mice per group

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [week0, week4]
      focus: week4                 # Compare early → mid-stage genotype effects
  groups_combinations:
    - [wildtype, knockout]
  use_lmm: true                    # Critical: captures mouse-level random effects
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 8
    mem_mb: 8000
    time_min: 60
  statistical_tests:
    cpus: 16
    mem_mb: 16000
    time_min: 120

Expected command:

alleleflux run --config config_host.yml --threads 16

Additional Configuration (week4→week8 comparison):

For the longer-term co-evolution analysis, create a second config file with:

output:
  root_dir: output/host_microbe/week4_week8/

timepoints_combinations:
  - timepoint: [week4, week8]
    focus: week8                   # Compare mid-stage → late-stage genotype effects

Output files to examine:

Phase 1: week0_week4/step2_output/scores/mag_level_scores.tsv - Early genotype-dependent selection
Phase 2: week4_week8/step2_output/scores/mag_level_scores.tsv - Late genotype-dependent selection
Phase 1 LMM: week0_week4/step2_output/statistical_tests/lmm_results.tsv - Test genotype effect early
Phase 2 LMM: week4_week8/step2_output/statistical_tests/lmm_results.tsv - Test genotype effect late
step2_output/outliers/outlier_genes.tsv - Genotype-dependent adaptive genes
step2_output/scores/gene_scores.tsv - Filtered for surface proteins, secretion systems

Result interpretation:

Genotype-specific MAGs: MAGs with high scores in KO but low/zero in WT indicate host-genotype-dependent selection
LMM significance: p-values < 0.05 in LMM show genotype effect while controlling for individual mouse variation
Gene annotation: Focus outliers on: outer membrane proteins, secretion systems (T6SS, Sec), immune-related factors
Convergent evolution: If multiple MAGs show similar outliers (e.g., same flagellar genes), this suggests host-driven convergence
Temporal dynamics: Compare phase 1 (week0→week4) vs. phase 2 (week4→week8) to reveal whether selection is rapid early or gradual throughout
Example interpretation: If Bacteroides MAG shows parallelism 0% in WT but 4% in IL-10KO during phase 1, then 0% in WT but 7% in IL-10KO during phase 2, with outliers in flagellar genes and mucus-degrading glycosidases in both phases, this indicates sustained and strengthening IL-10-dependent selection for motile mucus-colonizers

Environmental Adaptation¶

Question: How do communities adapt to pollution?

Design: Contaminated vs. pristine soil sites, sampled at month 0, 3, 6. Multiple sites per treatment = 30+ samples

Complete Configuration:

input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/environmental_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/environmental/month0_month3/

quality_control:
  breadth_threshold: 0.1           # Environmental samples often have lower coverage
  min_sample_num: 3                # 4-5 replicates per group-timepoint

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [month0, month3]
      focus: month3
  groups_combinations:
    - [contaminated, pristine]
  use_lmm: false                   # Site effects complex; pair-wise comparisons instead
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 8
    mem_mb: 8000
    time_min: 60
  statistical_tests:
    cpus: 12
    mem_mb: 12000
    time_min: 90

Expected command:

alleleflux run --config config_env.yml --threads 8

Additional Configuration (month3→month6 comparison):

For the longer-term environmental adaptation analysis, create a second config file with:

output:
  root_dir: output/environmental/month3_month6/

timepoints_combinations:
  - timepoint: [month3, month6]
    focus: month6                  # Track sustained adaptation to contamination

Output files to examine:

Phase 1: month0_month3/step2_output/scores/mag_level_scores.tsv - Initial contamination response
Phase 2: month3_month6/step2_output/scores/mag_level_scores.tsv - Sustained adaptation or stabilization
Phase 1 CMH: month0_month3/step2_output/statistical_tests/cmh_results.tsv - Parallel response across sites
Phase 2 CMH: month3_month6/step2_output/statistical_tests/cmh_results.tsv - Sustained parallel effects
step2_output/outliers/outlier_genes.tsv - Pollutant-specific genes (e.g., heavy metal resistance, hydrocarbon degradation)
step2_output/scores/gene_scores.tsv - Pathway annotation for functional interpretation

Result interpretation:

Biphasic adaptation: Phase 1 (month0→month3) shows rapid initial response; Phase 2 (month3→month6) tracks sustained or enhanced adaptation
CMH significance: p < 0.05 confirms parallel evolution across contaminated replicates in each phase, not chance
Temporal progression: Contaminated site scores may increase across both phases; pristine site scores remain low and flat in both
Gene-level biomarkers: Outliers encoding heavy metal efflux (CzcA, CopA), hydrocarbon degradation (alkane hydroxylase, cytochrome P450), or xenobiotic pathways are credible pollutant-responsive genes
Functional categories: Use KEGG pathway annotation to identify complete degradation operons under selection
Example interpretation: Arthrobacter MAG shows parallelism 5% at contaminated site during phase 1 (month0→month3), with early outliers for mercury resistance (merA). During phase 2 (month3→month6), parallelism increases to 9.5%, with additional outliers for arsenic efflux (arsB) and PCB degradation. This indicates multi-stage adaptation: initial mercury response followed by broader multi-contaminant tolerance. Genes absent from pristine site samples in both phases confirm pollution-driven selection

Fecal Microbiota Transplant (FMT) Study¶

Question: How does the donor microbiome adapt and stabilize after transplant into recipients?

Design: 15 FMT recipients sampled pre-FMT, day 1, week 1, month 1, month 3 post-FMT. Include 5 donor samples for baseline = 80 total samples. Track whether donor-derived taxa establish and whether they evolve to match recipient genetics.

Complete Configuration:

input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/fmt_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/fmt_adaptation/pre_fmt_month3_post/

quality_control:
  breadth_threshold: 0.15          # Clinical samples vary; balance coverage vs. MAG count
  min_sample_num: 5                # ~15 recipients give ample replicates

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [pre_fmt, month3_post]
      focus: month3_post           # Final steady state
  groups_combinations:
    - [recipient, donor]           # Compare recipient+donor evolution
  use_lmm: true                    # Critical: each recipient is unique host environment
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 12
    mem_mb: 16000
    time_min: 120
  statistical_tests:
    cpus: 16
    mem_mb: 20000
    time_min: 180

Expected command:

alleleflux run --config config_fmt.yml --threads 16

Fine-grained temporal tracking:

For detailed temporal resolution of early establishment and late adaptation phases, create additional config files:

# Early phase: initial colonization shock
output:
  root_dir: output/fmt_adaptation/pre_fmt_day1_post/
timepoints_combinations:
  - timepoint: [pre_fmt, day1_post]
    focus: day1_post

# Mid phase: establishment
output:
  root_dir: output/fmt_adaptation/day1_post_week1_post/
timepoints_combinations:
  - timepoint: [day1_post, week1_post]
    focus: week1_post

# Late phase: stabilization
output:
  root_dir: output/fmt_adaptation/week1_post_month3_post/
timepoints_combinations:
  - timepoint: [week1_post, month3_post]
    focus: month3_post

Output files to examine:

Long-term: pre_fmt_month3_post/step2_output/scores/mag_level_scores.tsv - Overall recipient vs donor trajectory
Early phase: pre_fmt_day1_post/step2_output/scores/mag_level_scores.tsv - Colonization shock signature
Establishment: day1_post_week1_post/step2_output/scores/mag_level_scores.tsv - Early stabilization
Late phase: week1_post_month3_post/step2_output/scores/mag_level_scores.tsv - Final adaptation
All phases: step2_output/statistical_tests/lmm_results.tsv - Recipient-level variation
All phases: step2_output/outliers/outlier_genes.tsv - Recipient-specific adaptations (phase-dependent)

Result interpretation:

Bifurcation in scores: Long-term analysis (pre_fmt→month3_post) should show donor group scores stable ~0-2% across all timepoints; recipient group scores increase from ~1% (pre-FMT) to 3-8% by month3_post
Temporal kinetics across phases:
- Early phase (pre→day1): Colonization shock; large allele frequency swings as stressed donor cells meet new environment; weak but detectable parallelism (~1-2%)
- Establishment phase (day1→week1): Intermediate dynamics; scores increase toward stabilization plateau
- Late phase (week1→month3): Plateau phase; scores stabilize as donor microbiota equilibrates; CMH p-values strengthen (p < 0.01) indicating coordinated adaptation across recipients
Gene-level biomarkers vary by phase:
- Early (day1_post): High allele frequency variance; genes involved in stress response, adhesion initiation
- Late (month3_post): Stabilized genes in colonization (adhesins, mucin-binding), nutrient scavenging (vitamin synthesis, carbohydrate transport), immune evasion (flagellar reduction, LPS modification)
Cross-recipient parallelism: High CMH significance in late phase indicates similar donor lineages undergo similar selective pressures in different recipients, revealing universal recipient-environment constraints
Example interpretation: Donor-derived Faecalibacterium MAG shows parallelism 0.5% (pre-FMT, long-term view) or complex signal in early phases (pre→day1 shows 1%, day1→week1 shows 2%, week1→month3 shows 5% incremental). Outlier genes at month3 include butyrate production pathway genes and flagellar genes (selected out). Compare to donor samples (score ~0% across all phases), confirming sustained recipient-driven selection

Experimental Evolution in Bioreactors¶

Question: How do microbial communities evolve during serial passage in controlled bioreactor conditions?

Design: Replicate bioreactors (e.g., n=4) under two conditions: high temperature (40°C) vs. standard (37°C). Samples taken every 50 generations for 200 generations = 5 timepoints × 2 conditions × 4 replicates = 40 samples.

Complete Configuration:

input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/bioreactor_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/bioreactor_evolution/gen0_gen200/

quality_control:
  breadth_threshold: 0.2           # Experimental design = uniform high coverage
  min_sample_num: 6                # 4 replicates per condition per timepoint

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [gen0, gen200]
      focus: gen200                # Compare ancestral vs. final evolved state
  groups_combinations:
    - [high_temp, standard]
  use_lmm: true                    # Captures bioreactor-specific effects
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 8
    mem_mb: 8000
    time_min: 60
  statistical_tests:
    cpus: 16
    mem_mb: 16000
    time_min: 120

Expected command:

alleleflux run --config config_bioreactor.yml --threads 16

Intermediate timepoint analysis:

For resolution of evolutionary kinetics across generational transitions, create two additional configs:

# Early-to-mid evolution
output:
  root_dir: output/bioreactor_evolution/gen0_gen100/
timepoints_combinations:
  - timepoint: [gen0, gen100]
    focus: gen100

# Mid-to-late evolution
output:
  root_dir: output/bioreactor_evolution/gen100_gen200/
timepoints_combinations:
  - timepoint: [gen100, gen200]
    focus: gen200

Output files to examine:

Long-term: gen0_gen200/step2_output/scores/mag_level_scores.tsv - Overall 200-generation trajectory per condition
Early-mid: gen0_gen100/step2_output/scores/mag_level_scores.tsv - Initial adaptation phase
Mid-late: gen100_gen200/step2_output/scores/mag_level_scores.tsv - Secondary adaptation or plateau
All phases: step2_output/statistical_tests/cmh_results.tsv - Parallel evolution across bioreactor replicates
All phases: step2_output/outliers/outlier_genes.tsv - Adaptive genes, timestamped by phase
All phases: step2_output/evolution/dnds_results.tsv - dN/dS ratio for adaptive validation (if enabled)

Result interpretation:

Strong parallelism signature (long-term, gen0→gen200): Expect parallelism scores 10-20% in high_temp group (cleaner signal than in vivo due to controlled conditions). Standard group should stay <2%
CMH statistical strength: High-replicate experimental design (n=4 bioreactors) yields very strong CMH signals (p < 0.001), confirming reproducible evolution, not sampling noise
Temporal kinetics (multi-phase view):
- Phase 1 (gen0→gen100): Weak to moderate signal (parallelism ~1-5%), stochastic drift→early favorable mutations sweep
- Phase 2 (gen100→gen200): Scores plateau or increase further (parallelism 10-15% additional), late-stage stabilization or secondary mutations
- Long-term (gen0→gen200): Combined signal showing cumulative evolution (10-20% total)
Gene appearance timeline: Parse outlier_genes.tsv by phase:
- Early outliers (gen0-gen100): First-hit beneficial genes (e.g., heat shock protein upregulators)
- Late outliers (gen100-gen200): Second-site compensatory mutations or fine-tuning
Predicted temperature adaptation genes: Heat shock proteins (GroEL, DnaK), membrane lipid remodeling, oxidative stress resistance (catalase, superoxide dismutase)
dN/dS validation: If outlier genes show dN/dS > 1, confirms positive selection. dN/dS near 0-0.2 may indicate relaxed purifying selection (hitchhiking) or structural evolution
Example interpretation: A core metabolic MAG (e.g., Achromobacter) shows: Phase 1 (gen0→gen100) parallelism 2% with early outlier flagellar gene (likely thermotaxis); Phase 2 (gen100→gen200) parallelism 13% additional with outliers in GroEL and transporter gene. CMH p-value 1e-6 confirms all 4 high-temp bioreactors independently selected this gene set in both phases. Compare to standard condition (parallelism <1% in both phases), confirming temperature-driven selection. This pattern suggests: (1) initial thermotaxis response (phase 1), (2) proteostasis and nutrient transport optimization at elevated temperature (phase 2)

Configuration Strategy Guide¶

Choosing the right configuration for your study design is critical. This section summarizes the key decisions and tradeoffs.

Data Type: `single` vs. `longitudinal`¶

Aspect	`data_type: "single"`	`data_type: "longitudinal"`
Use when	One timepoint only; comparing cross-sectional groups	Multiple timepoints; tracking evolution over time
Sample design	Disease vs. healthy; treatment A vs. treatment B	Pre/during/post; day0 → day7 → day30
Statistical tests	Unpaired two-sample test, single-sample tests	Paired two-sample test, CMH, LMM
Power requirements	More samples needed (n=10-15 per group)	Fewer samples (n=4-8 per group) - paired design has higher power
Output structure	Flat: one score per MAG per group	Hierarchical: one score per MAG per timepoint per group
Example	Gut microbiota in IBD vs. control	Antibiotic resistance during treatment: pre/during/post

Decision rule: If you have multiple timepoints, use longitudinal for better statistical power and ability to detect temporal dynamics.

Statistical Tests: When to Enable Each¶

Test	Enable when	Key output	Notes
Unpaired two-sample (`use_significance_tests: true`)	Any design with 2+ groups	`p_value` per MAG	Always compute; foundational test
LMM (`use_lmm: true`)	Unbalanced designs, repeated measures, covariates	`lmm_p_value`, `effect_size`	Use for mouse/host/individual variation; essential for clinical studies
CMH (`use_cmh: true`)	High replicates (n≥4), stratified designs	`cmh_p_value`, `stratified_odds_ratio`	Detects consistent allele changes across replicates; excellent for experimental replicates

Decision rules:

Longitudinal + LMM: Clinical/in vivo studies (e.g., FMT, antibiotic treatment). LMM handles repeated measures per subject
Longitudinal + CMH: Experimental designs with multiple independent replicates (e.g., bioreactors, replicate mice)
All three: High-power designs (n≥10 samples per group, n≥4 temporal points) - maximize signal detection
Single + unpaired only: Low-budget studies, cross-sectional designs

Quality Control Thresholds: `breadth_threshold` and `min_sample_num`¶

Scenario	`breadth_threshold`	`min_sample_num`	Rationale
High-quality isolated culture	0.5-1.0	3	Uniform coverage; can be strict
Human gut microbiota (high biomass)	0.2-0.3	5-6	Deep sequencing + abundant organisms
Environmental samples	0.05-0.1	3-4	Sparse coverage; retain rare MAGs
Ultra-low biomass (lung, blood)	0.01-0.05	2-3	Contamination risk; threshold critical
Mixed/stressed microbiota	0.15	4	Intermediate; account for uneven sampling

Interpretation of thresholds:

breadth_threshold: Fraction of genomic positions with ≥1 coverage. A threshold of 0.2 means “MAG must be present at 20%+ of its genome in a sample”
min_sample_num: Minimum sample count required for a MAG to pass QC per group-timepoint combination

Tuning strategy:

Start conservative (breadth 0.2, min_sample 6) to get clean signal
If too few MAGs pass QC, relax to 0.15/5 or 0.1/4
Never go below 0.05 breadth (high false positive risk from mapping errors)
Never set min_sample_num below the number of replicates in your smallest group

Resource Allocation Recommendations¶

resources:
  allele_freq:          # Profiling phase (step 1)
    cpus: 8-16          # Scales linearly with # MAGs and samples
    mem_mb: 8000-16000  # Mostly BAM loading
    time_min: 60-120    # Depends on sequencing depth

  statistical_tests:    # Analysis phase (step 2)
    cpus: 12-32         # Parallelizes well across genes/MAGs
    mem_mb: 16000-32000 # Keeps large matrices in memory
    time_min: 120-240   # Scales with # positions, # tests, # replicates

Scaling rules:

CPUs: Use --threads equal to available cores. AlleleFlux parallelizes MAG and gene processing
Memory:
- Minimum 8 GB (single MAG, small dataset)
- Standard 16 GB (typical microbiome study, 50-200 MAGs)
- High-demand 32+ GB (>500 MAGs, 1000+ samples, many-replicate CMH tests)
Time:
- Step 1 profiling: ~5-10 min per sample per core
- Step 2 analysis: ~1-10 min per MAG (depends on test complexity, # positions)
- CMH tests dominate runtime on high-replicate designs

Cluster submission (SLURM):

# Typical microbiome study
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=240

alleleflux run --config config.yml --threads 16

Matching Config to Study Characteristics¶

Large animal study (e.g., cattle microbiome, n=30 cattle, 3 timepoints):

data_type: longitudinal
breadth_threshold: 0.2
min_sample_num: 8
use_lmm: true         # Accounts for individual animal variation
use_cmh: true         # 30 animals = many replicates

Clinical trial (e.g., probiotic intervention, n=20 subjects, pre/post):

data_type: longitudinal
breadth_threshold: 0.15
min_sample_num: 5
use_lmm: true         # Essential: subject-level random effects
use_significance_tests: true
use_cmh: false        # Only 20 subjects; LMM better handles imbalance

Bioreactor experiment (n=4 replicates, 5 timepoints, controlled):

data_type: longitudinal
breadth_threshold: 0.2
min_sample_num: 6     # Exceed replicate count for robustness
use_lmm: true
use_cmh: true         # 4 replicates perfect for CMH

Environmental survey (e.g., soil sites, low coverage, n=10 sites):

data_type: single     # One-time sampling
breadth_threshold: 0.1
min_sample_num: 3     # Conservative with coverage
use_lmm: false
use_significance_tests: true

For complete worked examples, see Tutorial and Interpreting Results.

Use Cases¶

Antibiotic Resistance Evolution¶

Diet-Microbiome Adaptation¶

Host-Microbe Co-evolution¶

Environmental Adaptation¶

Fecal Microbiota Transplant (FMT) Study¶

Experimental Evolution in Bioreactors¶

Configuration Strategy Guide¶

Data Type: single vs. longitudinal¶

Statistical Tests: When to Enable Each¶

Quality Control Thresholds: breadth_threshold and min_sample_num¶

Resource Allocation Recommendations¶

Matching Config to Study Characteristics¶

Data Type: `single` vs. `longitudinal`¶

Quality Control Thresholds: `breadth_threshold` and `min_sample_num`¶