# Use Cases

Real-world applications of AlleleFlux for metagenomic evolution studies.

## Antibiotic Resistance Evolution

**Question**: How do bacterial communities evolve under antibiotic treatment?

**Design**: Longitudinal fecal samples from antibiotic-treated vs. control mice (pre, during, post-treatment). 10 treated, 8 control mice with 3 samples each = 54 total samples. To analyze with AlleleFlux, we compare two timepoints at a time using separate configurations: pre→during and during→post.

**Complete Configuration (pre→during comparison)**:

```yaml
input:
   bam_dir: data/bam/
   fasta: data/reference/combined_mags.fa
   metadata: data/metadata/sample_metadata.tsv
   prodigal_path: data/prodigal/combined_genes.gff
   gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
   mag_mapping: data/mapping/mag_contig_mapping.tsv

 output:
   root_dir: output/antibiotic_resistance/pre_during/

 quality_control:
   breadth_threshold: 0.2           # Require 20% genome coverage
   min_sample_num: 6                # At least 6 samples per group-timepoint

 analysis:
   data_type: longitudinal
   timepoints_combinations:
     - timepoint: [pre, during]       # Compare pre → during treatment
       focus: during                  # Focus on during-treatment state
   groups_combinations:
     - [antibiotic, control]
   use_lmm: true
   use_significance_tests: true
   use_cmh: true

 statistics:
   p_value_threshold: 0.05
   fdr_method: fdr_bh
   fdr_threshold: 0.1

 resources:
   allele_freq:
     cpus: 8
     mem_mb: 8000
     time_min: 60
   statistical_tests:
     cpus: 16
     mem_mb: 16000
     time_min: 120
```

**Additional Configuration (during→post comparison)**:

For the post-treatment phase, create a second config file with:
```yaml
timepoints_combinations:
  - timepoint: [during, post]      # Compare during → post treatment
    focus: post                     # Focus on post-treatment state
```

**Expected command**:

```bash
# Phase 1: Early response to antibiotic
alleleflux run --config config_antibiotic_pre_during.yml --threads 16

# Phase 2: Late response and stabilization
alleleflux run --config config_antibiotic_during_post.yml --threads 16
```

**Output files to examine**:
- `pre_during/step1_output/eligibility_tables/eligibility_table_pre_during-antibiotic_control.tsv` - Which MAGs pass QC
- `pre_during/step2_output/scores/mag_level_scores.tsv` - Parallelism scores during initial treatment response
- `during_post/step2_output/scores/mag_level_scores.tsv` - Parallelism scores during stabilization phase
- Combined gene-level scores identifying resistance candidates across both phases
- Combined statistical results controlling for mouse identity

**Result interpretation**:
1. **Early response phase (pre→during)**: MAGs with parallelism scores >2-3% show rapid adaptation to antibiotic stress. Expect lower scores than late phase
2. **Late stabilization phase (during→post)**: Scores may increase further if selection continues, or plateau if community has stabilized
3. **Divergence between phases**: Compare scores to identify which genes are selected early (rapid response) vs. late (fine-tuning). Genes present in both phases indicate core resistance mechanisms
4. **Cross-group comparison**: Compare antibiotic group evolution to control group (expect <2% parallelism in controls for both phases)
5. **Gene-level outliers**: Look for genes in CARD database (antibiotic resistance genes). Unknown genes are candidates for novel resistance mechanisms
6. **CMH significance**: p-values < 0.05 indicate consistent allele changes across individual mice, ruling out within-mouse noise
7. **Example interpretation**: If *Bacteroides* MAG shows parallelism score 3% in pre→during (treated group) with early outliers including `tetR`, and then 6% in during→post with additional outliers in a porin gene, this indicates: (1) early selection for direct resistance, (2) late selection for drug efflux optimization

## Diet-Microbiome Adaptation

**Question**: How do microbiomes adapt to dietary changes?

**Design**: Two diet groups (high-fat vs. standard diet), 12 mice per group, sampled at baseline, week 1, and week 4 = 72 total samples

**Complete Configuration**:

```yaml
input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/diet_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/diet_adaptation/baseline_week1/

quality_control:
  breadth_threshold: 0.1           # Lower threshold (may have diet-dependent coverage)
  min_sample_num: 4                # 12 mice per group, require 4 across samples

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [baseline, week1]
      focus: week1                 # Focus on initial dietary response
  groups_combinations:
    - [highfat, standard]
  use_lmm: true
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 12
    mem_mb: 12000
    time_min: 90
  statistical_tests:
    cpus: 12
    mem_mb: 12000
    time_min: 90
```

**Expected command**:

```bash
alleleflux run --config config_diet.yml --threads 12
```

**Additional Configuration (week1→week4 comparison)**:

For the longer-term adaptation trajectory, create a second config file with:
```yaml
output:
  root_dir: output/diet_adaptation/week1_week4/

timepoints_combinations:
  - timepoint: [week1, week4]      # Compare week1 → week4 adaptation
    focus: week4                     # Focus on steady-state adaptations
```

**Output files to examine**:
- Phase 1: `baseline_week1/step1_output/eligibility_tables/eligibility_table_baseline_week1-highfat_standard.tsv`
- Phase 2: `week1_week4/step1_output/eligibility_tables/eligibility_table_week1_week4-highfat_standard.tsv`
- Phase 1 scores: `baseline_week1/step2_output/scores/mag_level_scores.tsv` - Early dietary response
- Phase 2 scores: `week1_week4/step2_output/scores/mag_level_scores.tsv` - Stabilization phase
- `step2_output/scores/gene_scores.tsv` - Metabolic genes under selection
- `step2_output/scores/taxa_scores.tsv` - Phylum/Family level aggregation for dietary responders
- `step2_output/outliers/outlier_genes.tsv` - Genes under selection, mapped to KEGG pathways

**Result interpretation**:
1. **Two-phase adaptation**: Phase 1 (baseline→week1) shows rapid initial response; Phase 2 (week1→week4) shows slower stabilization
2. **Metabolic focus**: Filter `gene_scores.tsv` for genes annotated with: carbohydrate transport, lipid metabolism, short-chain fatty acid synthesis
3. **Taxa-level patterns**: In high-fat group, expect high scores for Bacteroidetes (fiber fermenters) and Faecalibacterium (SCFA producers). Standard diet shows stable low scores in both phases
4. **Phase comparison**: Scores may increase, plateau, or decrease from phase 1 to phase 2 depending on adaptation kinetics
5. **Example interpretation**: A Roseburia MAG may show parallelism 3% in phase 1 (baseline→week1) with early outliers in carbohydrate transport, then 5% in phase 2 (week1→week4) with additional outliers in propionate synthesis genes. This indicates initial colonization of gut with subsequent metabolic fine-tuning

## Host-Microbe Co-evolution

**Question**: How do host genotypes shape microbial evolution?

**Design**: Longitudinal samples from WT vs. knockout mice, 10 mice per group, 3 timepoints = 60 total samples

**Complete Configuration**:

```yaml
input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/host_genotype_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/host_microbe/week0_week4/

quality_control:
  breadth_threshold: 0.15
  min_sample_num: 5                # 10 mice per group

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [week0, week4]
      focus: week4                 # Compare early → mid-stage genotype effects
  groups_combinations:
    - [wildtype, knockout]
  use_lmm: true                    # Critical: captures mouse-level random effects
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 8
    mem_mb: 8000
    time_min: 60
  statistical_tests:
    cpus: 16
    mem_mb: 16000
    time_min: 120
```

**Expected command**:

```bash
alleleflux run --config config_host.yml --threads 16
```

**Additional Configuration (week4→week8 comparison)**:

For the longer-term co-evolution analysis, create a second config file with:
```yaml
output:
  root_dir: output/host_microbe/week4_week8/

timepoints_combinations:
  - timepoint: [week4, week8]
    focus: week8                   # Compare mid-stage → late-stage genotype effects
```

**Output files to examine**:
- Phase 1: `week0_week4/step2_output/scores/mag_level_scores.tsv` - Early genotype-dependent selection
- Phase 2: `week4_week8/step2_output/scores/mag_level_scores.tsv` - Late genotype-dependent selection
- Phase 1 LMM: `week0_week4/step2_output/statistical_tests/lmm_results.tsv` - Test genotype effect early
- Phase 2 LMM: `week4_week8/step2_output/statistical_tests/lmm_results.tsv` - Test genotype effect late
- `step2_output/outliers/outlier_genes.tsv` - Genotype-dependent adaptive genes
- `step2_output/scores/gene_scores.tsv` - Filtered for surface proteins, secretion systems

**Result interpretation**:
1. **Genotype-specific MAGs**: MAGs with high scores in KO but low/zero in WT indicate host-genotype-dependent selection
2. **LMM significance**: p-values < 0.05 in LMM show genotype effect while controlling for individual mouse variation
3. **Gene annotation**: Focus outliers on: outer membrane proteins, secretion systems (T6SS, Sec), immune-related factors
4. **Convergent evolution**: If multiple MAGs show similar outliers (e.g., same flagellar genes), this suggests host-driven convergence
5. **Temporal dynamics**: Compare phase 1 (week0→week4) vs. phase 2 (week4→week8) to reveal whether selection is rapid early or gradual throughout
6. **Example interpretation**: If Bacteroides MAG shows parallelism 0% in WT but 4% in IL-10KO during phase 1, then 0% in WT but 7% in IL-10KO during phase 2, with outliers in flagellar genes and mucus-degrading glycosidases in both phases, this indicates sustained and strengthening IL-10-dependent selection for motile mucus-colonizers

## Environmental Adaptation

**Question**: How do communities adapt to pollution?

**Design**: Contaminated vs. pristine soil sites, sampled at month 0, 3, 6. Multiple sites per treatment = 30+ samples

**Complete Configuration**:

```yaml
input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/environmental_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/environmental/month0_month3/

quality_control:
  breadth_threshold: 0.1           # Environmental samples often have lower coverage
  min_sample_num: 3                # 4-5 replicates per group-timepoint

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [month0, month3]
      focus: month3
  groups_combinations:
    - [contaminated, pristine]
  use_lmm: false                   # Site effects complex; pair-wise comparisons instead
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 8
    mem_mb: 8000
    time_min: 60
  statistical_tests:
    cpus: 12
    mem_mb: 12000
    time_min: 90
```

**Expected command**:

```bash
alleleflux run --config config_env.yml --threads 8
```

**Additional Configuration (month3→month6 comparison)**:

For the longer-term environmental adaptation analysis, create a second config file with:
```yaml
output:
  root_dir: output/environmental/month3_month6/

timepoints_combinations:
  - timepoint: [month3, month6]
    focus: month6                  # Track sustained adaptation to contamination
```

**Output files to examine**:
- Phase 1: `month0_month3/step2_output/scores/mag_level_scores.tsv` - Initial contamination response
- Phase 2: `month3_month6/step2_output/scores/mag_level_scores.tsv` - Sustained adaptation or stabilization
- Phase 1 CMH: `month0_month3/step2_output/statistical_tests/cmh_results.tsv` - Parallel response across sites
- Phase 2 CMH: `month3_month6/step2_output/statistical_tests/cmh_results.tsv` - Sustained parallel effects
- `step2_output/outliers/outlier_genes.tsv` - Pollutant-specific genes (e.g., heavy metal resistance, hydrocarbon degradation)
- `step2_output/scores/gene_scores.tsv` - Pathway annotation for functional interpretation

**Result interpretation**:
1. **Biphasic adaptation**: Phase 1 (month0→month3) shows rapid initial response; Phase 2 (month3→month6) tracks sustained or enhanced adaptation
2. **CMH significance**: p < 0.05 confirms parallel evolution across contaminated replicates in each phase, not chance
3. **Temporal progression**: Contaminated site scores may increase across both phases; pristine site scores remain low and flat in both
4. **Gene-level biomarkers**: Outliers encoding heavy metal efflux (CzcA, CopA), hydrocarbon degradation (alkane hydroxylase, cytochrome P450), or xenobiotic pathways are credible pollutant-responsive genes
5. **Functional categories**: Use KEGG pathway annotation to identify complete degradation operons under selection
6. **Example interpretation**: Arthrobacter MAG shows parallelism 5% at contaminated site during phase 1 (month0→month3), with early outliers for mercury resistance (merA). During phase 2 (month3→month6), parallelism increases to 9.5%, with additional outliers for arsenic efflux (arsB) and PCB degradation. This indicates multi-stage adaptation: initial mercury response followed by broader multi-contaminant tolerance. Genes absent from pristine site samples in both phases confirm pollution-driven selection

## Fecal Microbiota Transplant (FMT) Study

**Question**: How does the donor microbiome adapt and stabilize after transplant into recipients?

**Design**: 15 FMT recipients sampled pre-FMT, day 1, week 1, month 1, month 3 post-FMT. Include 5 donor samples for baseline = 80 total samples. Track whether donor-derived taxa establish and whether they evolve to match recipient genetics.

**Complete Configuration**:

```yaml
input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/fmt_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/fmt_adaptation/pre_fmt_month3_post/

quality_control:
  breadth_threshold: 0.15          # Clinical samples vary; balance coverage vs. MAG count
  min_sample_num: 5                # ~15 recipients give ample replicates

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [pre_fmt, month3_post]
      focus: month3_post           # Final steady state
  groups_combinations:
    - [recipient, donor]           # Compare recipient+donor evolution
  use_lmm: true                    # Critical: each recipient is unique host environment
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 12
    mem_mb: 16000
    time_min: 120
  statistical_tests:
    cpus: 16
    mem_mb: 20000
    time_min: 180
```

**Expected command**:

```bash
alleleflux run --config config_fmt.yml --threads 16
```

**Fine-grained temporal tracking**:

For detailed temporal resolution of early establishment and late adaptation phases, create additional config files:
```yaml
# Early phase: initial colonization shock
output:
  root_dir: output/fmt_adaptation/pre_fmt_day1_post/
timepoints_combinations:
  - timepoint: [pre_fmt, day1_post]
    focus: day1_post

# Mid phase: establishment
output:
  root_dir: output/fmt_adaptation/day1_post_week1_post/
timepoints_combinations:
  - timepoint: [day1_post, week1_post]
    focus: week1_post

# Late phase: stabilization
output:
  root_dir: output/fmt_adaptation/week1_post_month3_post/
timepoints_combinations:
  - timepoint: [week1_post, month3_post]
    focus: month3_post
```

**Output files to examine**:
- Long-term: `pre_fmt_month3_post/step2_output/scores/mag_level_scores.tsv` - Overall recipient vs donor trajectory
- Early phase: `pre_fmt_day1_post/step2_output/scores/mag_level_scores.tsv` - Colonization shock signature
- Establishment: `day1_post_week1_post/step2_output/scores/mag_level_scores.tsv` - Early stabilization
- Late phase: `week1_post_month3_post/step2_output/scores/mag_level_scores.tsv` - Final adaptation
- All phases: `step2_output/statistical_tests/lmm_results.tsv` - Recipient-level variation
- All phases: `step2_output/outliers/outlier_genes.tsv` - Recipient-specific adaptations (phase-dependent)

**Result interpretation**:
1. **Bifurcation in scores**: Long-term analysis (pre_fmt→month3_post) should show donor group scores stable ~0-2% across all timepoints; recipient group scores increase from ~1% (pre-FMT) to 3-8% by month3_post
2. **Temporal kinetics across phases**:
   - **Early phase (pre→day1)**: Colonization shock; large allele frequency swings as stressed donor cells meet new environment; weak but detectable parallelism (~1-2%)
   - **Establishment phase (day1→week1)**: Intermediate dynamics; scores increase toward stabilization plateau
   - **Late phase (week1→month3)**: Plateau phase; scores stabilize as donor microbiota equilibrates; CMH p-values strengthen (p < 0.01) indicating coordinated adaptation across recipients
3. **Gene-level biomarkers vary by phase**:
   - **Early (day1_post)**: High allele frequency variance; genes involved in stress response, adhesion initiation
   - **Late (month3_post)**: Stabilized genes in colonization (adhesins, mucin-binding), nutrient scavenging (vitamin synthesis, carbohydrate transport), immune evasion (flagellar reduction, LPS modification)
4. **Cross-recipient parallelism**: High CMH significance in late phase indicates similar donor lineages undergo similar selective pressures in different recipients, revealing universal recipient-environment constraints
5. **Example interpretation**: Donor-derived Faecalibacterium MAG shows parallelism 0.5% (pre-FMT, long-term view) or complex signal in early phases (pre→day1 shows 1%, day1→week1 shows 2%, week1→month3 shows 5% incremental). Outlier genes at month3 include butyrate production pathway genes and flagellar genes (selected out). Compare to donor samples (score ~0% across all phases), confirming sustained recipient-driven selection

## Experimental Evolution in Bioreactors

**Question**: How do microbial communities evolve during serial passage in controlled bioreactor conditions?

**Design**: Replicate bioreactors (e.g., n=4) under two conditions: high temperature (40°C) vs. standard (37°C). Samples taken every 50 generations for 200 generations = 5 timepoints × 2 conditions × 4 replicates = 40 samples.

**Complete Configuration**:

```yaml
input:
  bam_dir: data/bam/
  fasta: data/reference/combined_mags.fa
  metadata: data/metadata/bioreactor_metadata.tsv
  prodigal_path: data/prodigal/combined_genes.gff
  gtdb_taxonomy: data/taxonomy/gtdb_taxonomy.tsv
  mag_mapping: data/mapping/mag_contig_mapping.tsv

output:
  root_dir: output/bioreactor_evolution/gen0_gen200/

quality_control:
  breadth_threshold: 0.2           # Experimental design = uniform high coverage
  min_sample_num: 6                # 4 replicates per condition per timepoint

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [gen0, gen200]
      focus: gen200                # Compare ancestral vs. final evolved state
  groups_combinations:
    - [high_temp, standard]
  use_lmm: true                    # Captures bioreactor-specific effects
  use_significance_tests: true
  use_cmh: true

statistics:
  p_value_threshold: 0.05
  fdr_method: fdr_bh
  fdr_threshold: 0.1

resources:
  allele_freq:
    cpus: 8
    mem_mb: 8000
    time_min: 60
  statistical_tests:
    cpus: 16
    mem_mb: 16000
    time_min: 120
```

**Expected command**:

```bash
alleleflux run --config config_bioreactor.yml --threads 16
```

**Intermediate timepoint analysis**:

For resolution of evolutionary kinetics across generational transitions, create two additional configs:
```yaml
# Early-to-mid evolution
output:
  root_dir: output/bioreactor_evolution/gen0_gen100/
timepoints_combinations:
  - timepoint: [gen0, gen100]
    focus: gen100

# Mid-to-late evolution
output:
  root_dir: output/bioreactor_evolution/gen100_gen200/
timepoints_combinations:
  - timepoint: [gen100, gen200]
    focus: gen200
```

**Output files to examine**:
- Long-term: `gen0_gen200/step2_output/scores/mag_level_scores.tsv` - Overall 200-generation trajectory per condition
- Early-mid: `gen0_gen100/step2_output/scores/mag_level_scores.tsv` - Initial adaptation phase
- Mid-late: `gen100_gen200/step2_output/scores/mag_level_scores.tsv` - Secondary adaptation or plateau
- All phases: `step2_output/statistical_tests/cmh_results.tsv` - Parallel evolution across bioreactor replicates
- All phases: `step2_output/outliers/outlier_genes.tsv` - Adaptive genes, timestamped by phase
- All phases: `step2_output/evolution/dnds_results.tsv` - dN/dS ratio for adaptive validation (if enabled)

**Result interpretation**:
1. **Strong parallelism signature** (long-term, gen0→gen200): Expect parallelism scores 10-20% in high_temp group (cleaner signal than in vivo due to controlled conditions). Standard group should stay <2%
2. **CMH statistical strength**: High-replicate experimental design (n=4 bioreactors) yields very strong CMH signals (p < 0.001), confirming reproducible evolution, not sampling noise
3. **Temporal kinetics** (multi-phase view):
   - **Phase 1 (gen0→gen100)**: Weak to moderate signal (parallelism ~1-5%), stochastic drift→early favorable mutations sweep
   - **Phase 2 (gen100→gen200)**: Scores plateau or increase further (parallelism 10-15% additional), late-stage stabilization or secondary mutations
   - **Long-term (gen0→gen200)**: Combined signal showing cumulative evolution (10-20% total)
4. **Gene appearance timeline**: Parse `outlier_genes.tsv` by phase:
   - **Early outliers (gen0-gen100)**: First-hit beneficial genes (e.g., heat shock protein upregulators)
   - **Late outliers (gen100-gen200)**: Second-site compensatory mutations or fine-tuning
5. **Predicted temperature adaptation genes**: Heat shock proteins (GroEL, DnaK), membrane lipid remodeling, oxidative stress resistance (catalase, superoxide dismutase)
6. **dN/dS validation**: If outlier genes show dN/dS > 1, confirms positive selection. dN/dS near 0-0.2 may indicate relaxed purifying selection (hitchhiking) or structural evolution
7. **Example interpretation**: A core metabolic MAG (e.g., Achromobacter) shows: Phase 1 (gen0→gen100) parallelism 2% with early outlier flagellar gene (likely thermotaxis); Phase 2 (gen100→gen200) parallelism 13% additional with outliers in GroEL and transporter gene. CMH p-value 1e-6 confirms all 4 high-temp bioreactors independently selected this gene set in both phases. Compare to standard condition (parallelism <1% in both phases), confirming temperature-driven selection. This pattern suggests: (1) initial thermotaxis response (phase 1), (2) proteostasis and nutrient transport optimization at elevated temperature (phase 2)

## Configuration Strategy Guide

Choosing the right configuration for your study design is critical. This section summarizes the key decisions and tradeoffs.

### Data Type: `single` vs. `longitudinal`

| Aspect | `data_type: "single"` | `data_type: "longitudinal"` |
|--------|----------------------|---------------------------|
| **Use when** | One timepoint only; comparing cross-sectional groups | Multiple timepoints; tracking evolution over time |
| **Sample design** | Disease vs. healthy; treatment A vs. treatment B | Pre/during/post; day0 → day7 → day30 |
| **Statistical tests** | Unpaired two-sample test, single-sample tests | Paired two-sample test, CMH, LMM |
| **Power requirements** | More samples needed (n=10-15 per group) | Fewer samples (n=4-8 per group) - paired design has higher power |
| **Output structure** | Flat: one score per MAG per group | Hierarchical: one score per MAG per timepoint per group |
| **Example** | Gut microbiota in IBD vs. control | Antibiotic resistance during treatment: pre/during/post |

**Decision rule**: If you have multiple timepoints, use `longitudinal` for better statistical power and ability to detect temporal dynamics.

### Statistical Tests: When to Enable Each

| Test | Enable when | Key output | Notes |
|------|-------------|-----------|-------|
| **Unpaired two-sample** (`use_significance_tests: true`) | Any design with 2+ groups | `p_value` per MAG | Always compute; foundational test |
| **LMM** (`use_lmm: true`) | Unbalanced designs, repeated measures, covariates | `lmm_p_value`, `effect_size` | Use for mouse/host/individual variation; essential for clinical studies |
| **CMH** (`use_cmh: true`) | High replicates (n≥4), stratified designs | `cmh_p_value`, `stratified_odds_ratio` | Detects consistent allele changes across replicates; excellent for experimental replicates |

**Decision rules**:
- **Longitudinal + LMM**: Clinical/in vivo studies (e.g., FMT, antibiotic treatment). LMM handles repeated measures per subject
- **Longitudinal + CMH**: Experimental designs with multiple independent replicates (e.g., bioreactors, replicate mice)
- **All three**: High-power designs (n≥10 samples per group, n≥4 temporal points) - maximize signal detection
- **Single + unpaired only**: Low-budget studies, cross-sectional designs

### Quality Control Thresholds: `breadth_threshold` and `min_sample_num`

| Scenario | `breadth_threshold` | `min_sample_num` | Rationale |
|----------|------------------|-----------------|-----------|
| **High-quality isolated culture** | 0.5-1.0 | 3 | Uniform coverage; can be strict |
| **Human gut microbiota (high biomass)** | 0.2-0.3 | 5-6 | Deep sequencing + abundant organisms |
| **Environmental samples** | 0.05-0.1 | 3-4 | Sparse coverage; retain rare MAGs |
| **Ultra-low biomass (lung, blood)** | 0.01-0.05 | 2-3 | Contamination risk; threshold critical |
| **Mixed/stressed microbiota** | 0.15 | 4 | Intermediate; account for uneven sampling |

**Interpretation of thresholds**:
- **`breadth_threshold`**: Fraction of genomic positions with ≥1 coverage. A threshold of 0.2 means "MAG must be present at 20%+ of its genome in a sample"
- **`min_sample_num`**: Minimum sample count required for a MAG to pass QC per group-timepoint combination

**Tuning strategy**:
1. Start conservative (breadth 0.2, min_sample 6) to get clean signal
2. If too few MAGs pass QC, relax to 0.15/5 or 0.1/4
3. Never go below 0.05 breadth (high false positive risk from mapping errors)
4. Never set min_sample_num below the number of replicates in your smallest group

### Resource Allocation Recommendations

```yaml
resources:
  allele_freq:          # Profiling phase (step 1)
    cpus: 8-16          # Scales linearly with # MAGs and samples
    mem_mb: 8000-16000  # Mostly BAM loading
    time_min: 60-120    # Depends on sequencing depth

  statistical_tests:    # Analysis phase (step 2)
    cpus: 12-32         # Parallelizes well across genes/MAGs
    mem_mb: 16000-32000 # Keeps large matrices in memory
    time_min: 120-240   # Scales with # positions, # tests, # replicates
```

**Scaling rules**:
- **CPUs**: Use `--threads` equal to available cores. AlleleFlux parallelizes MAG and gene processing
- **Memory**: 
  - Minimum 8 GB (single MAG, small dataset)
  - Standard 16 GB (typical microbiome study, 50-200 MAGs)
  - High-demand 32+ GB (>500 MAGs, 1000+ samples, many-replicate CMH tests)
- **Time**:
  - Step 1 profiling: ~5-10 min per sample per core
  - Step 2 analysis: ~1-10 min per MAG (depends on test complexity, # positions)
  - CMH tests dominate runtime on high-replicate designs

**Cluster submission (SLURM)**:
```bash
# Typical microbiome study
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=240

alleleflux run --config config.yml --threads 16
```

### Matching Config to Study Characteristics

**Large animal study** (e.g., cattle microbiome, n=30 cattle, 3 timepoints):
```yaml
data_type: longitudinal
breadth_threshold: 0.2
min_sample_num: 8
use_lmm: true         # Accounts for individual animal variation
use_cmh: true         # 30 animals = many replicates
```

**Clinical trial** (e.g., probiotic intervention, n=20 subjects, pre/post):
```yaml
data_type: longitudinal
breadth_threshold: 0.15
min_sample_num: 5
use_lmm: true         # Essential: subject-level random effects
use_significance_tests: true
use_cmh: false        # Only 20 subjects; LMM better handles imbalance
```

**Bioreactor experiment** (n=4 replicates, 5 timepoints, controlled):
```yaml
data_type: longitudinal
breadth_threshold: 0.2
min_sample_num: 6     # Exceed replicate count for robustness
use_lmm: true
use_cmh: true         # 4 replicates perfect for CMH
```

**Environmental survey** (e.g., soil sites, low coverage, n=10 sites):
```yaml
data_type: single     # One-time sampling
breadth_threshold: 0.1
min_sample_num: 3     # Conservative with coverage
use_lmm: false
use_significance_tests: true
```

---

For complete worked examples, see [Tutorial](tutorial.md) and [Interpreting Results](../usage/interpreting_results.md).