Interpreting Results

This guide explains how to interpret the results produced by AlleleFlux.

Output Structure

AlleleFlux organizes results by analysis type:

output/
├── profiles/               # Per-sample allele counts
├── metadata/               # Per-MAG sample metadata
├── QC/                     # Quality control results
├── eligibility_table_*.tsv # MAG eligibility for tests
├── allele_analysis/        # Allele frequency analysis
├── significance_tests/     # Statistical test results
│   ├── two_sample_unpaired/
│   ├── two_sample_paired/
│   ├── single_sample/
│   ├── lmm/
│   └── cmh/
├── scores/
│   ├── intermediate/        # Per-MAG scores
│   └── processed/
│       ├── combined/        # Aggregated MAG/taxa scores
│       └── gene_scores/     # Gene-level scores
└── outliers/               # High-scoring outlier genes

Quick Interpretation Guide

After running AlleleFlux, follow this decision tree to interpret your results:

  1. Check eligibility → How many MAGs passed QC?

  2. Compare MAG scores → Which MAGs show the strongest selection?

  3. Examine gene scores → Which genes drive the MAG-level signal?

  4. Check outliers → Are there genes under exceptionally strong selection?

  5. Validate with dN/dS → Do outlier genes show positive selection (dN/dS > 1)?

  6. Cross-reference tests → Are results consistent across statistical methods?

Step-by-Step Checklist

Here’s a practical checklist to work through after analysis:

  • [ ] Step 1: Check Eligibility - Open eligibility_table_*.tsv and count how many MAGs passed QC filtering (have “true” in relevant test columns). If <10 MAGs eligible, lower QC thresholds or check coverage.

  • [ ] Step 2: Compare MAG Scores - Load scores from scores/processed/combined/MAG/ and identify top 10-20 MAGs by parallelism/divergence score. These are your primary candidates.

  • [ ] Step 3: Examine Gene Scores - For top MAGs, inspect scores/processed/gene_scores/{mag}_{test}_gene_scores_individual.tsv to see which genes contribute most to the MAG signal.

  • [ ] Step 4: Check Outliers - Review outliers/{mag}_{test}_outlier_genes.tsv files. Genes with multiple significant p-values (binomial, Poisson) are robust candidates.

  • [ ] Step 5: Validate with dN/dS - If available, run alleleflux-dnds --timepoint_data results/ on outlier genes. High scores with dN/dS > 1 indicate positive selection.

  • [ ] Step 6: Cross-Reference Tests - Compare results across statistical methods (two_sample_paired, LMM, CMH). Genes significant in 2+ methods are more robust.

Key Files

1. Eligibility Table

eligibility_table_{timepoints}-{groups}.tsv - Which MAGs qualify for each test based on coverage/samples

2. Statistical Tests (in significance_tests/)

Per-MAG files with p-values and test statistics: - {mag}_two_sample_unpaired.tsv.gz - Unpaired group comparisons - {mag}_lmm.tsv.gz - Linear mixed models - {mag}_cmh.tsv.gz - Cochran-Mantel-Haenszel tests

Key columns: contig, position, gene_id, p_value_{test}, q_value_{test}

3. Scores (in scores/processed/combined/)

  • scores_{test}-{tp}-{gr}-MAGs.tsv - MAG-level parallelism/divergence scores

  • scores_{test}-{tp}-{gr}-{taxon}.tsv - Taxonomic aggregations (phylum to species)

4. Gene Scores (in scores/processed/gene_scores/)

  • {mag}_{test}_gene_scores_individual.tsv - Per-gene scores

  • {mag}_{test}_outlier_genes.tsv - High-scoring genes under selection

Score Interpretation

Parallelism Score (0-100%) Measures consistent allele changes across replicates within a group. High scores → deterministic evolution (not random drift).

Divergence Score (0-100%) Quantifies allele frequency differences between groups. High scores → differential selection between conditions.

CMH Test Detects parallel allele changes across timepoints while controlling for individual variation. Particularly powerful for longitudinal studies.

File Format Details

Profile files (profiles/{sample}_{mag}_profiled.tsv.gz): contig, position, ref_base, total_coverage, A, C, G, T, gene_id

Statistical test results (significance_tests/{test}/{mag}_{test}.tsv.gz): contig, position, gene_id, p_value_{test}, q_value_{test}

Gene scores (scores/processed/gene_scores/{mag}_{test}_gene_scores_individual.tsv): gene_id, total_sites, significant_sites, score_%

Outliers (outliers/{mag}_{test}_outlier_genes.tsv): gene_id, gene_score_%, mag_score_%, p_value_binomial, p_value_poisson

Loading Results in Python

Load and Display Top 10 MAGs by Parallelism Score

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load MAG-level scores
scores = pd.read_csv(
    "output/longitudinal/scores/processed/combined/MAG/"
    "scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv",
    sep="\t"
)

# Top 10 MAGs by parallelism score
top_mags = scores.nlargest(10, "score_two_sample_paired_tTest (%)")
print("Top 10 MAGs by Parallelism Score:")
print(top_mags[["MAG_ID", "score_two_sample_paired_tTest (%)", 
                "total_sites_per_group_two_sample_paired_tTest", "phylum", "family"]])

Plot Score Distribution

# Visualize score distribution across all MAGs
fig, ax = plt.subplots(figsize=(10, 6))
scores["score_two_sample_paired_tTest (%)"].hist(bins=50, ax=ax, edgecolor="black")
ax.set_xlabel("Parallelism Score (%)")
ax.set_ylabel("Number of MAGs")
ax.set_title("Distribution of Parallelism Scores Across MAGs")
ax.axvline(scores["score_two_sample_paired_tTest (%)"].median(), 
           color="red", linestyle="--", label="Median")
plt.legend()
plt.tight_layout()
plt.savefig("score_distribution.png", dpi=150)
plt.show()

# Summary statistics
print(f"Mean parallelism score: {scores['score_two_sample_paired_tTest (%)'].mean():.2f}%")
print(f"Median parallelism score: {scores['score_two_sample_paired_tTest (%)'].median():.2f}%")
print(f"95th percentile: {scores['score_two_sample_paired_tTest (%)'].quantile(0.95):.2f}%")

Compare Scores Across Different Tests

# Load scores from different statistical tests
output_base = "output/longitudinal/scores/processed/combined/MAG/"
paired = pd.read_csv(
    f"{output_base}scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv", 
    sep="\t"
)
lmm = pd.read_csv(
    f"{output_base}scores_lmm-pre_post-treatment_control-MAGs.tsv", 
    sep="\t"
)
cmh = pd.read_csv(
    f"{output_base}scores_cmh-pre_post-treatment_control-MAGs.tsv", 
    sep="\t"
)

# Merge on MAG_ID
merged = paired[["MAG_ID", "score_two_sample_paired_tTest (%)"]].merge(
    lmm[["MAG_ID", "score_lmm (%)"]],
    on="MAG_ID",
    how="outer"
).merge(
    cmh[["MAG_ID", "score_CMH (%)"]],
    on="MAG_ID",
    how="outer"
)

# Fill NaN for MAGs missing from certain tests
merged.fillna(0, inplace=True)

# Correlation between test results
print("Correlation between statistical tests:")
print(merged[["score_two_sample_paired_tTest (%)", "score_lmm (%)", "score_CMH (%)"]].corr())

# Visualize consistency
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].scatter(merged["score_two_sample_paired_tTest (%)"], merged["score_lmm (%)"], alpha=0.6)
axes[0].set_xlabel("Two-Sample Paired (%)"); axes[0].set_ylabel("LMM (%)")
axes[1].scatter(merged["score_two_sample_paired_tTest (%)"], merged["score_CMH (%)"], alpha=0.6)
axes[1].set_xlabel("Two-Sample Paired (%)"); axes[1].set_ylabel("CMH (%)")
axes[2].scatter(merged["score_lmm (%)"], merged["score_CMH (%)"], alpha=0.6)
axes[2].set_xlabel("LMM (%)"); axes[2].set_ylabel("CMH (%)")
plt.tight_layout()
plt.savefig("test_consistency.png", dpi=150)
plt.show()

# Identify robust MAGs (high scores in multiple tests)
threshold = 50  # 50th percentile
robust_mags = merged[
    (merged["score_two_sample_paired_tTest (%)"] > threshold) |
    (merged["score_lmm (%)"] > threshold) |
    (merged["score_CMH (%)"] > threshold)
]
print(f"Found {len(robust_mags)} MAGs with score > {threshold}% in at least one test")

Load and Investigate Outlier Genes for a Specific MAG

# Load outlier genes for a specific MAG
import gzip

mag_id = "Bacteroides_001"
outlier_file = (
    f"output/longitudinal/outliers/pre_post-treatment_control/"
    f"{mag_id}_two_sample_paired_outlier_genes.tsv.gz"
)

# Handle gzipped files
if outlier_file.endswith('.gz'):
    outliers = pd.read_csv(outlier_file, sep="\t", compression='gzip')
else:
    outliers = pd.read_csv(outlier_file, sep="\t")

print(f"Found {len(outliers)} outlier genes for {mag_id}")

# Display top outliers sorted by gene score
top_outliers = outliers.nlargest(20, "gene_score_%")[
    ["gene_id", "gene_score_%", "mag_score_%", "p_value_binomial", "p_value_poisson"]
]
print("\nTop 20 Outlier Genes:")
print(top_outliers)

# Filter by significance (both p-values < 0.05)
significant = outliers[
    (outliers["p_value_binomial"] < 0.05) & 
    (outliers["p_value_poisson"] < 0.05)
]
print(f"\nGenes significant in both tests: {len(significant)}")
print(significant[["gene_id", "gene_score_%", "p_value_binomial", "p_value_poisson"]])

# Visualize outlier gene scores
fig, ax = plt.subplots(figsize=(12, 6))
outliers_sorted = outliers.sort_values("gene_score_%", ascending=False).head(30)
ax.barh(range(len(outliers_sorted)), outliers_sorted["gene_score_%"])
ax.set_yticks(range(len(outliers_sorted)))
ax.set_yticklabels(outliers_sorted["gene_id"], fontsize=9)
ax.set_xlabel("Gene Score (%)")
ax.set_title(f"Top 30 Outlier Genes in {mag_id}")
plt.tight_layout()
plt.savefig(f"{mag_id}_outliers.png", dpi=150)
plt.show()

Merge Taxonomic Information with Scores

# Load MAG scores with taxonomic metadata
scores = pd.read_csv(
    "output/longitudinal/scores/processed/combined/MAG/"
    "scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv",
    sep="\t"
)

# Group by phylum and calculate aggregated statistics
phylum_stats = scores.groupby("phylum").agg({
    "score_two_sample_paired_tTest (%)": ["mean", "max", "count"],
    "total_sites_per_group_two_sample_paired_tTest": "mean"
}).round(2)
phylum_stats.columns = ["mean_score", "max_score", "num_mags", "avg_sites"]
phylum_stats = phylum_stats.sort_values("mean_score", ascending=False)

print("Parallelism Scores by Phylum:")
print(phylum_stats)

# Similarly for family level
family_stats = scores.groupby("family").agg({
    "score_two_sample_paired_tTest (%)": ["mean", "max", "count"]
}).round(2)
family_stats.columns = ["mean_score", "max_score", "num_mags"]
family_stats = family_stats[family_stats["num_mags"] >= 3]  # Filter families with 3+ MAGs
family_stats = family_stats.sort_values("mean_score", ascending=False)

print("\nTop Families by Mean Parallelism Score:")
print(family_stats.head(10))

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Phylum-level scores
phylum_stats.head(10)["mean_score"].plot(kind="barh", ax=ax1, color="steelblue")
ax1.set_xlabel("Mean Parallelism Score (%)")
ax1.set_title("Top 10 Phyla by Mean Score")

# Family-level scores
family_stats.head(10)["mean_score"].plot(kind="barh", ax=ax2, color="coral")
ax2.set_xlabel("Mean Parallelism Score (%)")
ax2.set_title("Top 10 Families by Mean Score")

plt.tight_layout()
plt.savefig("taxonomic_scores.png", dpi=150)
plt.show()

Loading Results in R

Load MAG Scores with tidyverse

library(tidyverse)
library(ggplot2)

# Load MAG-level scores
scores <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv")

# Display top MAGs by score
scores %>%
  arrange(desc(`score_two_sample_paired_tTest (%)`)) %>%
  select(MAG_ID, `score_two_sample_paired_tTest (%)`, total_sites_per_group_two_sample_paired_tTest, phylum, family) %>%
  head(10)

Arrange by Score and Select Columns

# More focused view with key columns
scores %>%
  select(MAG_ID, `score_two_sample_paired_tTest (%)`, phylum, family, class) %>%
  arrange(desc(`score_two_sample_paired_tTest (%)`)) %>%
  filter(`score_two_sample_paired_tTest (%)` > 50) %>%
  head(20)

# Save top MAGs to CSV
top_mags <- scores %>%
  arrange(desc(`score_two_sample_paired_tTest (%)`)) %>%
  slice_head(n = 20)
write_csv(top_mags, "top_20_mags.csv")

Group by Taxonomic Family and Summarize

# Aggregate statistics by family
family_summary <- scores %>%
  group_by(family) %>%
  summarise(
    mean_score = mean(`score_two_sample_paired_tTest (%)`, na.rm = TRUE),
    max_score = max(`score_two_sample_paired_tTest (%)`, na.rm = TRUE),
    median_score = median(`score_two_sample_paired_tTest (%)`, na.rm = TRUE),
    n_mags = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_score))

print(family_summary)

# Visualize top families
family_summary %>%
  filter(n_mags >= 3) %>%  # Filter families with 3+ MAGs
  slice_head(n = 15) %>%
  ggplot(aes(x = reorder(family, mean_score), y = mean_score)) +
  geom_col(fill = "steelblue", color = "black") +
  geom_text(aes(label = paste0("n=", n_mags)), hjust = -0.1, size = 3) +
  coord_flip() +
  labs(x = "Family", y = "Mean Parallelism Score (%)", 
       title = "Top Families by Mean Parallelism Score") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

ggsave("family_scores.png", width = 10, height = 6, dpi = 150)

Aggregate by Phylum and Compare

# Phylum-level aggregation
phylum_summary <- scores %>%
  group_by(phylum) %>%
  summarise(
    mean_score = mean(`score_two_sample_paired_tTest (%)`, na.rm = TRUE),
    sd_score = sd(`score_two_sample_paired_tTest (%)`, na.rm = TRUE),
    max_score = max(`score_two_sample_paired_tTest (%)`, na.rm = TRUE),
    n_mags = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_score))

print(phylum_summary)

# Boxplot by phylum
scores %>%
  ggplot(aes(x = reorder(phylum, `score_two_sample_paired_tTest (%)`, FUN = median), 
             y = `score_two_sample_paired_tTest (%)`)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(width = 0.2, alpha = 0.3, size = 2) +
  coord_flip() +
  labs(x = "Phylum", y = "Parallelism Score (%)",
       title = "Score Distribution by Phylum") +
  theme_minimal()

ggsave("phylum_boxplot.png", width = 10, height = 6, dpi = 150)

Join Multiple Test Results

# Load multiple test results
paired <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv") %>%
  select(MAG_ID, score_paired = `score_two_sample_paired_tTest (%)`)

lmm <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_lmm-pre_post-treatment_control-MAGs.tsv") %>%
  select(MAG_ID, score_lmm = `score_lmm (%)`)

cmh <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_cmh-pre_post-treatment_control-MAGs.tsv") %>%
  select(MAG_ID, score_cmh = `score_CMH (%)`)

# Join all results
merged_scores <- paired %>%
  full_join(lmm, by = "MAG_ID") %>%
  full_join(cmh, by = "MAG_ID") %>%
  replace_na(list(score_paired = 0, score_lmm = 0, score_cmh = 0))

# Identify robust MAGs (high scores in multiple tests)
robust_mags <- merged_scores %>%
  mutate(num_high_scores = rowSums(across(starts_with("score_"), ~. > 50))) %>%
  filter(num_high_scores >= 2) %>%
  arrange(desc(num_high_scores), desc(score_paired))

print(robust_mags)

# Correlation between tests
merged_scores %>%
  select(starts_with("score_")) %>%
  cor()

# Visualize correlation
library(corrplot)
scores_matrix <- merged_scores %>% select(starts_with("score_")) %>% as.matrix()
corrplot(cor(scores_matrix), type = "lower", diag = FALSE)
ggsave("test_correlation.png", width = 8, height = 8, dpi = 150)

Load and Analyze Gene-Level Data

# Load gene scores for a specific MAG
mag_id <- "Bacteroides_001"
gene_scores <- read_tsv(
  glue::glue("output/longitudinal/scores/processed/gene_scores/{mag_id}_two_sample_paired_gene_scores_individual.tsv")
)

# Top genes by score
gene_scores %>%
  arrange(desc(score_%)) %>%
  select(gene_id, score_%, total_sites, significant_sites) %>%
  head(20)

# Load outlier genes
outliers <- read_tsv(
  glue::glue("output/longitudinal/outliers/pre_post-treatment_control/{mag_id}_two_sample_paired_outlier_genes.tsv")
)

# Filter for highly significant outliers
outliers_sig <- outliers %>%
  filter(p_value_binomial < 0.05, p_value_poisson < 0.05) %>%
  arrange(desc(gene_score_%))

print(outliers_sig)

# Visualization
outliers %>%
  arrange(desc(gene_score_%)) %>%
  slice_head(n = 25) %>%
  ggplot(aes(x = reorder(gene_id, gene_score_%), y = gene_score_%)) +
  geom_col(fill = "coral", color = "black") +
  geom_point(aes(color = p_value_binomial < 0.05), size = 3, alpha = 0.7) +
  coord_flip() +
  labs(x = "Gene ID", y = "Gene Score (%)", 
       title = glue::glue("Top 25 Outlier Genes in {mag_id}"),
       color = "Binomial p < 0.05") +
  theme_minimal()

ggsave(glue::glue("{mag_id}_outliers.png"), width = 12, height = 8, dpi = 150)

Biological Validation

After identifying candidate MAGs and genes, validate findings with these approaches:

Functional Annotation

  • KEGG Annotation - Assign enzymes (EC numbers) and metabolic pathways to outlier genes. Look for enrichment in pathways relevant to your experimental conditions:

    • Antibiotic exposure → Check for antibiotic resistance genes (aminoglycoside modifying enzymes, β-lactamases, efflux pumps)

    • Nutrient limitation → Look for transport proteins, biosynthetic pathways

    • High temperature → Molecular chaperones, thermostability proteins

  • COG (Cluster of Orthologous Groups) - Functional classification by COG family. Count significant genes per COG category (e.g., “Energy production” vs. “Defense mechanisms”) and test for over-representation using Fisher’s exact test.

  • Pfam Domains - Protein family annotation identifies functional domains within outlier genes. Multiple domains may indicate multifunctional or regulatory proteins.

  • Tools for annotation:

    # Using eggNOG-mapper (recommended for MAGs)
    emapper.py -i outlier_genes.fasta --output annotation_results -m diamond --data_dir /path/to/eggnog_data
    
    # Using InterProScan for detailed domain analysis
    interproscan.sh -i outlier_genes.fasta -f tsv -o annotation_results.tsv
    

Cross-referencing with Literature

  • Curated Databases:

    • CARD (Comprehensive Antibiotic Resistance Database) - If studying antibiotic resistance

    • VFDB (Virulence Factor Database) - For pathogenicity studies

    • BioCyc - Metabolic pathways and reactions

    • TCDB - Membrane transport proteins

  • Literature Mining:

    • Search PubMed/Google Scholar for MAG taxa (genus/family) + experimental condition (e.g., “Bacteroides antibiotic resistance”)

    • Compare high-scoring genes with previously published selection signatures in same organisms

    • Check if same functional categories repeatedly appear in independent studies

  • Example validation script:

    import pandas as pd
    
    # Load outliers and CARD annotations
    outliers = pd.read_csv("outlier_genes.tsv", sep="\t")
    card = pd.read_csv("card_annotations.tsv", sep="\t")
    
    # Find overlap
    resistance_genes = outliers.merge(card, on="gene_id", how="inner")
    print(f"Outlier genes matching CARD database: {len(resistance_genes)}")
    print(resistance_genes[["gene_id", "ARO", "phenotype"]])
    

Checking Genomic Context

  • Operon Structure - Genes functioning in the same pathway often cluster in operons. Check if multiple outlier genes are co-localized:

    # Extract genomic regions around outlier genes
    samtools faidx reference.fa contig:start-end > region.fa
    # Use BLAST or alignment tools to compare regions across samples
    
  • Mobile Genetic Elements - Outlier genes near transposases, integrases, or IS elements suggest recent horizontal gene transfer and potential adaptive introgression:

    # Check proximity to identified mobile elements
    outliers_nearby_mobile = outliers[
        outliers["distance_to_nearest_mobile_element"] < 5000  # 5 kb window
    ]
    print(f"Outliers near mobile elements: {len(outliers_nearby_mobile)}")
    
  • Genomic Islands - Large segments with unusual GC content or codon usage may indicate acquired genes. Tools like SIGI-HMM or IslandViewer can identify these regions.

  • Conservation Analysis - Highly conserved genes under selection are more likely functionally important:

    # Calculate sequence identity with close relatives
    # High identity + positive selection → functionally constrained
    # Low identity + positive selection → recent adaptation
    

Validating with dN/dS Analysis

  • Positive Selection Signature - dN/dS > 1 indicates positive (diversifying) selection; dN/dS < 1 indicates purifying selection:

    # AlleleFlux includes dN/dS calculation from timepoint data
    alleleflux-dnds --timepoint_data step2_results/ --output_dir dnds_results/ --cpus 16
    
  • Interpreting Results:

    • High score + dN/dS > 1 → Gene evolving adaptively under strong selection

    • High score + dN/dS < 1 → Adaptive structural changes without amino acid replacement (e.g., regulatory mutations)

    • High score + dN/dS ≈ 1 → Neutral evolution or recent selective sweep

  • Combined Analysis:

    import pandas as pd
    
    # Load outliers and dN/dS results
    outliers = pd.read_csv("outlier_genes.tsv", sep="\t")
    dnds = pd.read_csv("dnds_results.tsv", sep="\t")
    
    # Merge and filter
    combined = outliers.merge(dnds, on="gene_id", how="inner")
    positive_selection = combined[combined["dnds_ratio"] > 1]
    
    print(f"Genes with adaptive evolution signature: {len(positive_selection)}")
    print(positive_selection[["gene_id", "gene_score_%", "dnds_ratio"]])
    

Experimental Validation

  • Allele Frequency Confirmation - Use targeted sequencing or qPCR to independently validate predicted allele frequency changes:

    # Design primers flanking polymorphic sites
    # qPCR quantify allele-specific transcripts or genomic DNA
    # Expected: High frequency allele in enriched samples, low in control
    
  • Phenotypic Testing - Culture representative strains carrying different alleles and measure phenotype under experimental conditions:

    • Antibiotic resistance: Growth inhibition assays with/without antibiotics

    • Nutrient metabolism: Growth rates on different substrates

    • Stress tolerance: Growth at extreme pH, temperature, or osmolarity

  • Longitudinal Tracking - For time-series data, verify that predicted allele trajectories match experimental observations:

    # Extract predicted vs. observed allele frequencies
    predicted = pd.read_csv("allele_predictions.tsv", sep="\t")
    observed = pd.read_csv("observed_frequencies.tsv", sep="\t")
    
    # Calculate R² between predicted and observed
    correlation = predicted.merge(observed, on=["sample", "gene", "position"]).corr()
    print(f"Prediction accuracy (R²): {correlation.iloc[0,1]**2:.3f}")
    
  • Heterologous Expression - Express outlier genes in model organisms (e.g., E. coli) to assess function:

    • Clone gene into expression vector

    • Measure phenotypic effect (growth, metabolite production, resistance)

    • Compare wildtype vs. allelic variants if polymorphisms encode different proteins

Analysis Workflow

Step 1: Check Eligibility

cat eligibility_table_pre_post-treatment_control.tsv

Identify MAGs with sufficient coverage for statistical tests.

Step 2: Examine Scores

# MAG-level scores
head scores_two_sample_unpaired-pre_post-treatment_control-MAGs.tsv

# Taxonomic aggregation (family level)
head scores_two_sample_unpaired-pre_post-treatment_control-family.tsv

Focus on MAGs/taxa with high parallelism or divergence scores.

Step 3: Investigate Genes

# Gene scores for a high-scoring MAG
head MAG123_two_sample_unpaired_gene_scores_individual.tsv

# Outlier genes
head MAG123_two_sample_unpaired_outlier_genes.tsv

Identify candidate genes under strong selection.

Step 4: Compare Tests

Check consistency across statistical approaches (two-sample, LMM, CMH). Genes significant in multiple tests are most robust.

Step 5: Functional Analysis

  • Annotate outlier genes (KEGG, COG, Pfam)

  • Check biological relevance to experimental conditions

  • Consider genomic context (operons, mobile elements)

Troubleshooting

No results / empty files - Check eligibility table: MAGs may not meet min_sample_num or breadth_threshold - Verify input file paths in configuration - Check log files in logs/ directory

Low scores across all MAGs - Insufficient selective pressure or inappropriate timepoints - Try lowering p_value_threshold (e.g., 0.1 instead of 0.05) - Check if experimental conditions are strong enough

Inconsistent results between tests - LMM is sensitive to experimental design complexity - Two-sample tests affected by unbalanced groups - CMH best for detecting consistent directional changes - Use multiple tests for robust conclusions

Missing gene IDs - Ensure Prodigal predictions match reference FASTA contig names - Verify prodigal_path in configuration - Check gene FASTA headers match contig naming

For visualization of results, see Visualization Guide.