# Interpreting Results This guide explains how to interpret the results produced by AlleleFlux. ## Output Structure AlleleFlux organizes results by analysis type: ```text output/ ├── profiles/ # Per-sample allele counts ├── metadata/ # Per-MAG sample metadata ├── QC/ # Quality control results ├── eligibility_table_*.tsv # MAG eligibility for tests ├── allele_analysis/ # Allele frequency analysis ├── significance_tests/ # Statistical test results │ ├── two_sample_unpaired/ │ ├── two_sample_paired/ │ ├── single_sample/ │ ├── lmm/ │ └── cmh/ ├── scores/ │ ├── intermediate/ # Per-MAG scores │ └── processed/ │ ├── combined/ # Aggregated MAG/taxa scores │ └── gene_scores/ # Gene-level scores └── outliers/ # High-scoring outlier genes ``` ## Quick Interpretation Guide After running AlleleFlux, follow this decision tree to interpret your results: 1. **Check eligibility** → How many MAGs passed QC? 2. **Compare MAG scores** → Which MAGs show the strongest selection? 3. **Examine gene scores** → Which genes drive the MAG-level signal? 4. **Check outliers** → Are there genes under exceptionally strong selection? 5. **Validate with dN/dS** → Do outlier genes show positive selection (dN/dS > 1)? 6. **Cross-reference tests** → Are results consistent across statistical methods? ### Step-by-Step Checklist Here's a practical checklist to work through after analysis: - [ ] **Step 1: Check Eligibility** - Open `eligibility_table_*.tsv` and count how many MAGs passed QC filtering (have "true" in relevant test columns). If <10 MAGs eligible, lower QC thresholds or check coverage. - [ ] **Step 2: Compare MAG Scores** - Load scores from `scores/processed/combined/MAG/` and identify top 10-20 MAGs by parallelism/divergence score. These are your primary candidates. - [ ] **Step 3: Examine Gene Scores** - For top MAGs, inspect `scores/processed/gene_scores/{mag}_{test}_gene_scores_individual.tsv` to see which genes contribute most to the MAG signal. - [ ] **Step 4: Check Outliers** - Review `outliers/{mag}_{test}_outlier_genes.tsv` files. Genes with multiple significant p-values (binomial, Poisson) are robust candidates. - [ ] **Step 5: Validate with dN/dS** - If available, run `alleleflux-dnds --timepoint_data results/` on outlier genes. High scores with dN/dS > 1 indicate positive selection. - [ ] **Step 6: Cross-Reference Tests** - Compare results across statistical methods (two_sample_paired, LMM, CMH). Genes significant in 2+ methods are more robust. ## Key Files **1. Eligibility Table** `eligibility_table_{timepoints}-{groups}.tsv` - Which MAGs qualify for each test based on coverage/samples **2. Statistical Tests** (in `significance_tests/`) Per-MAG files with p-values and test statistics: \- `{mag}_two_sample_unpaired.tsv.gz` - Unpaired group comparisons \- `{mag}_lmm.tsv.gz` - Linear mixed models \- `{mag}_cmh.tsv.gz` - Cochran-Mantel-Haenszel tests Key columns: `contig`, `position`, `gene_id`, `p_value_{test}`, `q_value_{test}` **3. Scores** (in `scores/processed/combined/`) - `scores_{test}-{tp}-{gr}-MAGs.tsv` - MAG-level parallelism/divergence scores - `scores_{test}-{tp}-{gr}-{taxon}.tsv` - Taxonomic aggregations (phylum to species) **4. Gene Scores** (in `scores/processed/gene_scores/`) - `{mag}_{test}_gene_scores_individual.tsv` - Per-gene scores - `{mag}_{test}_outlier_genes.tsv` - High-scoring genes under selection ## Score Interpretation **Parallelism Score** (0-100%) Measures consistent allele changes across replicates within a group. High scores → deterministic evolution (not random drift). **Divergence Score** (0-100%) Quantifies allele frequency differences between groups. High scores → differential selection between conditions. **CMH Test** Detects parallel allele changes across timepoints while controlling for individual variation. Particularly powerful for longitudinal studies. ## File Format Details **Profile files** (`profiles/{sample}_{mag}_profiled.tsv.gz`): `contig`, `position`, `ref_base`, `total_coverage`, `A`, `C`, `G`, `T`, `gene_id` **Statistical test results** (`significance_tests/{test}/{mag}_{test}.tsv.gz`): `contig`, `position`, `gene_id`, `p_value_{test}`, `q_value_{test}` **Gene scores** (`scores/processed/gene_scores/{mag}_{test}_gene_scores_individual.tsv`): `gene_id`, `total_sites`, `significant_sites`, `score_%` **Outliers** (`outliers/{mag}_{test}_outlier_genes.tsv`): `gene_id`, `gene_score_%`, `mag_score_%`, `p_value_binomial`, `p_value_poisson` ## Loading Results in Python ### Load and Display Top 10 MAGs by Parallelism Score ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load MAG-level scores scores = pd.read_csv( "output/longitudinal/scores/processed/combined/MAG/" "scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv", sep="\t" ) # Top 10 MAGs by parallelism score top_mags = scores.nlargest(10, "score_two_sample_paired_tTest (%)") print("Top 10 MAGs by Parallelism Score:") print(top_mags[["MAG_ID", "score_two_sample_paired_tTest (%)", "total_sites_per_group_two_sample_paired_tTest", "phylum", "family"]]) ``` ### Plot Score Distribution ```python # Visualize score distribution across all MAGs fig, ax = plt.subplots(figsize=(10, 6)) scores["score_two_sample_paired_tTest (%)"].hist(bins=50, ax=ax, edgecolor="black") ax.set_xlabel("Parallelism Score (%)") ax.set_ylabel("Number of MAGs") ax.set_title("Distribution of Parallelism Scores Across MAGs") ax.axvline(scores["score_two_sample_paired_tTest (%)"].median(), color="red", linestyle="--", label="Median") plt.legend() plt.tight_layout() plt.savefig("score_distribution.png", dpi=150) plt.show() # Summary statistics print(f"Mean parallelism score: {scores['score_two_sample_paired_tTest (%)'].mean():.2f}%") print(f"Median parallelism score: {scores['score_two_sample_paired_tTest (%)'].median():.2f}%") print(f"95th percentile: {scores['score_two_sample_paired_tTest (%)'].quantile(0.95):.2f}%") ``` ### Compare Scores Across Different Tests ```python # Load scores from different statistical tests output_base = "output/longitudinal/scores/processed/combined/MAG/" paired = pd.read_csv( f"{output_base}scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv", sep="\t" ) lmm = pd.read_csv( f"{output_base}scores_lmm-pre_post-treatment_control-MAGs.tsv", sep="\t" ) cmh = pd.read_csv( f"{output_base}scores_cmh-pre_post-treatment_control-MAGs.tsv", sep="\t" ) # Merge on MAG_ID merged = paired[["MAG_ID", "score_two_sample_paired_tTest (%)"]].merge( lmm[["MAG_ID", "score_lmm (%)"]], on="MAG_ID", how="outer" ).merge( cmh[["MAG_ID", "score_CMH (%)"]], on="MAG_ID", how="outer" ) # Fill NaN for MAGs missing from certain tests merged.fillna(0, inplace=True) # Correlation between test results print("Correlation between statistical tests:") print(merged[["score_two_sample_paired_tTest (%)", "score_lmm (%)", "score_CMH (%)"]].corr()) # Visualize consistency fig, axes = plt.subplots(1, 3, figsize=(15, 4)) axes[0].scatter(merged["score_two_sample_paired_tTest (%)"], merged["score_lmm (%)"], alpha=0.6) axes[0].set_xlabel("Two-Sample Paired (%)"); axes[0].set_ylabel("LMM (%)") axes[1].scatter(merged["score_two_sample_paired_tTest (%)"], merged["score_CMH (%)"], alpha=0.6) axes[1].set_xlabel("Two-Sample Paired (%)"); axes[1].set_ylabel("CMH (%)") axes[2].scatter(merged["score_lmm (%)"], merged["score_CMH (%)"], alpha=0.6) axes[2].set_xlabel("LMM (%)"); axes[2].set_ylabel("CMH (%)") plt.tight_layout() plt.savefig("test_consistency.png", dpi=150) plt.show() # Identify robust MAGs (high scores in multiple tests) threshold = 50 # 50th percentile robust_mags = merged[ (merged["score_two_sample_paired_tTest (%)"] > threshold) | (merged["score_lmm (%)"] > threshold) | (merged["score_CMH (%)"] > threshold) ] print(f"Found {len(robust_mags)} MAGs with score > {threshold}% in at least one test") ``` ### Load and Investigate Outlier Genes for a Specific MAG ```python # Load outlier genes for a specific MAG import gzip mag_id = "Bacteroides_001" outlier_file = ( f"output/longitudinal/outliers/pre_post-treatment_control/" f"{mag_id}_two_sample_paired_outlier_genes.tsv.gz" ) # Handle gzipped files if outlier_file.endswith('.gz'): outliers = pd.read_csv(outlier_file, sep="\t", compression='gzip') else: outliers = pd.read_csv(outlier_file, sep="\t") print(f"Found {len(outliers)} outlier genes for {mag_id}") # Display top outliers sorted by gene score top_outliers = outliers.nlargest(20, "gene_score_%")[ ["gene_id", "gene_score_%", "mag_score_%", "p_value_binomial", "p_value_poisson"] ] print("\nTop 20 Outlier Genes:") print(top_outliers) # Filter by significance (both p-values < 0.05) significant = outliers[ (outliers["p_value_binomial"] < 0.05) & (outliers["p_value_poisson"] < 0.05) ] print(f"\nGenes significant in both tests: {len(significant)}") print(significant[["gene_id", "gene_score_%", "p_value_binomial", "p_value_poisson"]]) # Visualize outlier gene scores fig, ax = plt.subplots(figsize=(12, 6)) outliers_sorted = outliers.sort_values("gene_score_%", ascending=False).head(30) ax.barh(range(len(outliers_sorted)), outliers_sorted["gene_score_%"]) ax.set_yticks(range(len(outliers_sorted))) ax.set_yticklabels(outliers_sorted["gene_id"], fontsize=9) ax.set_xlabel("Gene Score (%)") ax.set_title(f"Top 30 Outlier Genes in {mag_id}") plt.tight_layout() plt.savefig(f"{mag_id}_outliers.png", dpi=150) plt.show() ``` ### Merge Taxonomic Information with Scores ```python # Load MAG scores with taxonomic metadata scores = pd.read_csv( "output/longitudinal/scores/processed/combined/MAG/" "scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv", sep="\t" ) # Group by phylum and calculate aggregated statistics phylum_stats = scores.groupby("phylum").agg({ "score_two_sample_paired_tTest (%)": ["mean", "max", "count"], "total_sites_per_group_two_sample_paired_tTest": "mean" }).round(2) phylum_stats.columns = ["mean_score", "max_score", "num_mags", "avg_sites"] phylum_stats = phylum_stats.sort_values("mean_score", ascending=False) print("Parallelism Scores by Phylum:") print(phylum_stats) # Similarly for family level family_stats = scores.groupby("family").agg({ "score_two_sample_paired_tTest (%)": ["mean", "max", "count"] }).round(2) family_stats.columns = ["mean_score", "max_score", "num_mags"] family_stats = family_stats[family_stats["num_mags"] >= 3] # Filter families with 3+ MAGs family_stats = family_stats.sort_values("mean_score", ascending=False) print("\nTop Families by Mean Parallelism Score:") print(family_stats.head(10)) # Visualization fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # Phylum-level scores phylum_stats.head(10)["mean_score"].plot(kind="barh", ax=ax1, color="steelblue") ax1.set_xlabel("Mean Parallelism Score (%)") ax1.set_title("Top 10 Phyla by Mean Score") # Family-level scores family_stats.head(10)["mean_score"].plot(kind="barh", ax=ax2, color="coral") ax2.set_xlabel("Mean Parallelism Score (%)") ax2.set_title("Top 10 Families by Mean Score") plt.tight_layout() plt.savefig("taxonomic_scores.png", dpi=150) plt.show() ``` ## Loading Results in R ### Load MAG Scores with tidyverse ```r library(tidyverse) library(ggplot2) # Load MAG-level scores scores <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv") # Display top MAGs by score scores %>% arrange(desc(`score_two_sample_paired_tTest (%)`)) %>% select(MAG_ID, `score_two_sample_paired_tTest (%)`, total_sites_per_group_two_sample_paired_tTest, phylum, family) %>% head(10) ``` ### Arrange by Score and Select Columns ```r # More focused view with key columns scores %>% select(MAG_ID, `score_two_sample_paired_tTest (%)`, phylum, family, class) %>% arrange(desc(`score_two_sample_paired_tTest (%)`)) %>% filter(`score_two_sample_paired_tTest (%)` > 50) %>% head(20) # Save top MAGs to CSV top_mags <- scores %>% arrange(desc(`score_two_sample_paired_tTest (%)`)) %>% slice_head(n = 20) write_csv(top_mags, "top_20_mags.csv") ``` ### Group by Taxonomic Family and Summarize ```r # Aggregate statistics by family family_summary <- scores %>% group_by(family) %>% summarise( mean_score = mean(`score_two_sample_paired_tTest (%)`, na.rm = TRUE), max_score = max(`score_two_sample_paired_tTest (%)`, na.rm = TRUE), median_score = median(`score_two_sample_paired_tTest (%)`, na.rm = TRUE), n_mags = n(), .groups = "drop" ) %>% arrange(desc(mean_score)) print(family_summary) # Visualize top families family_summary %>% filter(n_mags >= 3) %>% # Filter families with 3+ MAGs slice_head(n = 15) %>% ggplot(aes(x = reorder(family, mean_score), y = mean_score)) + geom_col(fill = "steelblue", color = "black") + geom_text(aes(label = paste0("n=", n_mags)), hjust = -0.1, size = 3) + coord_flip() + labs(x = "Family", y = "Mean Parallelism Score (%)", title = "Top Families by Mean Parallelism Score") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, face = "bold")) ggsave("family_scores.png", width = 10, height = 6, dpi = 150) ``` ### Aggregate by Phylum and Compare ```r # Phylum-level aggregation phylum_summary <- scores %>% group_by(phylum) %>% summarise( mean_score = mean(`score_two_sample_paired_tTest (%)`, na.rm = TRUE), sd_score = sd(`score_two_sample_paired_tTest (%)`, na.rm = TRUE), max_score = max(`score_two_sample_paired_tTest (%)`, na.rm = TRUE), n_mags = n(), .groups = "drop" ) %>% arrange(desc(mean_score)) print(phylum_summary) # Boxplot by phylum scores %>% ggplot(aes(x = reorder(phylum, `score_two_sample_paired_tTest (%)`, FUN = median), y = `score_two_sample_paired_tTest (%)`)) + geom_boxplot(fill = "lightblue", color = "black") + geom_jitter(width = 0.2, alpha = 0.3, size = 2) + coord_flip() + labs(x = "Phylum", y = "Parallelism Score (%)", title = "Score Distribution by Phylum") + theme_minimal() ggsave("phylum_boxplot.png", width = 10, height = 6, dpi = 150) ``` ### Join Multiple Test Results ```r # Load multiple test results paired <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv") %>% select(MAG_ID, score_paired = `score_two_sample_paired_tTest (%)`) lmm <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_lmm-pre_post-treatment_control-MAGs.tsv") %>% select(MAG_ID, score_lmm = `score_lmm (%)`) cmh <- read_tsv("output/longitudinal/scores/processed/combined/MAG/scores_cmh-pre_post-treatment_control-MAGs.tsv") %>% select(MAG_ID, score_cmh = `score_CMH (%)`) # Join all results merged_scores <- paired %>% full_join(lmm, by = "MAG_ID") %>% full_join(cmh, by = "MAG_ID") %>% replace_na(list(score_paired = 0, score_lmm = 0, score_cmh = 0)) # Identify robust MAGs (high scores in multiple tests) robust_mags <- merged_scores %>% mutate(num_high_scores = rowSums(across(starts_with("score_"), ~. > 50))) %>% filter(num_high_scores >= 2) %>% arrange(desc(num_high_scores), desc(score_paired)) print(robust_mags) # Correlation between tests merged_scores %>% select(starts_with("score_")) %>% cor() # Visualize correlation library(corrplot) scores_matrix <- merged_scores %>% select(starts_with("score_")) %>% as.matrix() corrplot(cor(scores_matrix), type = "lower", diag = FALSE) ggsave("test_correlation.png", width = 8, height = 8, dpi = 150) ``` ### Load and Analyze Gene-Level Data ```r # Load gene scores for a specific MAG mag_id <- "Bacteroides_001" gene_scores <- read_tsv( glue::glue("output/longitudinal/scores/processed/gene_scores/{mag_id}_two_sample_paired_gene_scores_individual.tsv") ) # Top genes by score gene_scores %>% arrange(desc(score_%)) %>% select(gene_id, score_%, total_sites, significant_sites) %>% head(20) # Load outlier genes outliers <- read_tsv( glue::glue("output/longitudinal/outliers/pre_post-treatment_control/{mag_id}_two_sample_paired_outlier_genes.tsv") ) # Filter for highly significant outliers outliers_sig <- outliers %>% filter(p_value_binomial < 0.05, p_value_poisson < 0.05) %>% arrange(desc(gene_score_%)) print(outliers_sig) # Visualization outliers %>% arrange(desc(gene_score_%)) %>% slice_head(n = 25) %>% ggplot(aes(x = reorder(gene_id, gene_score_%), y = gene_score_%)) + geom_col(fill = "coral", color = "black") + geom_point(aes(color = p_value_binomial < 0.05), size = 3, alpha = 0.7) + coord_flip() + labs(x = "Gene ID", y = "Gene Score (%)", title = glue::glue("Top 25 Outlier Genes in {mag_id}"), color = "Binomial p < 0.05") + theme_minimal() ggsave(glue::glue("{mag_id}_outliers.png"), width = 12, height = 8, dpi = 150) ``` ## Biological Validation After identifying candidate MAGs and genes, validate findings with these approaches: ### Functional Annotation - **KEGG Annotation** - Assign enzymes (EC numbers) and metabolic pathways to outlier genes. Look for enrichment in pathways relevant to your experimental conditions: - Antibiotic exposure → Check for antibiotic resistance genes (aminoglycoside modifying enzymes, β-lactamases, efflux pumps) - Nutrient limitation → Look for transport proteins, biosynthetic pathways - High temperature → Molecular chaperones, thermostability proteins - **COG (Cluster of Orthologous Groups)** - Functional classification by COG family. Count significant genes per COG category (e.g., "Energy production" vs. "Defense mechanisms") and test for over-representation using Fisher's exact test. - **Pfam Domains** - Protein family annotation identifies functional domains within outlier genes. Multiple domains may indicate multifunctional or regulatory proteins. - **Tools for annotation**: ```bash # Using eggNOG-mapper (recommended for MAGs) emapper.py -i outlier_genes.fasta --output annotation_results -m diamond --data_dir /path/to/eggnog_data # Using InterProScan for detailed domain analysis interproscan.sh -i outlier_genes.fasta -f tsv -o annotation_results.tsv ``` ### Cross-referencing with Literature - **Curated Databases**: - **CARD (Comprehensive Antibiotic Resistance Database)** - If studying antibiotic resistance - **VFDB (Virulence Factor Database)** - For pathogenicity studies - **BioCyc** - Metabolic pathways and reactions - **TCDB** - Membrane transport proteins - **Literature Mining**: - Search PubMed/Google Scholar for MAG taxa (genus/family) + experimental condition (e.g., "Bacteroides antibiotic resistance") - Compare high-scoring genes with previously published selection signatures in same organisms - Check if same functional categories repeatedly appear in independent studies - **Example validation script**: ```python import pandas as pd # Load outliers and CARD annotations outliers = pd.read_csv("outlier_genes.tsv", sep="\t") card = pd.read_csv("card_annotations.tsv", sep="\t") # Find overlap resistance_genes = outliers.merge(card, on="gene_id", how="inner") print(f"Outlier genes matching CARD database: {len(resistance_genes)}") print(resistance_genes[["gene_id", "ARO", "phenotype"]]) ``` ### Checking Genomic Context - **Operon Structure** - Genes functioning in the same pathway often cluster in operons. Check if multiple outlier genes are co-localized: ```bash # Extract genomic regions around outlier genes samtools faidx reference.fa contig:start-end > region.fa # Use BLAST or alignment tools to compare regions across samples ``` - **Mobile Genetic Elements** - Outlier genes near transposases, integrases, or IS elements suggest recent horizontal gene transfer and potential adaptive introgression: ```python # Check proximity to identified mobile elements outliers_nearby_mobile = outliers[ outliers["distance_to_nearest_mobile_element"] < 5000 # 5 kb window ] print(f"Outliers near mobile elements: {len(outliers_nearby_mobile)}") ``` - **Genomic Islands** - Large segments with unusual GC content or codon usage may indicate acquired genes. Tools like SIGI-HMM or IslandViewer can identify these regions. - **Conservation Analysis** - Highly conserved genes under selection are more likely functionally important: ```python # Calculate sequence identity with close relatives # High identity + positive selection → functionally constrained # Low identity + positive selection → recent adaptation ``` ### Validating with dN/dS Analysis - **Positive Selection Signature** - dN/dS > 1 indicates positive (diversifying) selection; dN/dS < 1 indicates purifying selection: ```bash # AlleleFlux includes dN/dS calculation from timepoint data alleleflux-dnds --timepoint_data step2_results/ --output_dir dnds_results/ --cpus 16 ``` - **Interpreting Results**: - High score + dN/dS > 1 → Gene evolving adaptively under strong selection - High score + dN/dS < 1 → Adaptive structural changes without amino acid replacement (e.g., regulatory mutations) - High score + dN/dS ≈ 1 → Neutral evolution or recent selective sweep - **Combined Analysis**: ```python import pandas as pd # Load outliers and dN/dS results outliers = pd.read_csv("outlier_genes.tsv", sep="\t") dnds = pd.read_csv("dnds_results.tsv", sep="\t") # Merge and filter combined = outliers.merge(dnds, on="gene_id", how="inner") positive_selection = combined[combined["dnds_ratio"] > 1] print(f"Genes with adaptive evolution signature: {len(positive_selection)}") print(positive_selection[["gene_id", "gene_score_%", "dnds_ratio"]]) ``` ### Experimental Validation - **Allele Frequency Confirmation** - Use targeted sequencing or qPCR to independently validate predicted allele frequency changes: ```bash # Design primers flanking polymorphic sites # qPCR quantify allele-specific transcripts or genomic DNA # Expected: High frequency allele in enriched samples, low in control ``` - **Phenotypic Testing** - Culture representative strains carrying different alleles and measure phenotype under experimental conditions: - **Antibiotic resistance**: Growth inhibition assays with/without antibiotics - **Nutrient metabolism**: Growth rates on different substrates - **Stress tolerance**: Growth at extreme pH, temperature, or osmolarity - **Longitudinal Tracking** - For time-series data, verify that predicted allele trajectories match experimental observations: ```python # Extract predicted vs. observed allele frequencies predicted = pd.read_csv("allele_predictions.tsv", sep="\t") observed = pd.read_csv("observed_frequencies.tsv", sep="\t") # Calculate R² between predicted and observed correlation = predicted.merge(observed, on=["sample", "gene", "position"]).corr() print(f"Prediction accuracy (R²): {correlation.iloc[0,1]**2:.3f}") ``` - **Heterologous Expression** - Express outlier genes in model organisms (e.g., *E. coli*) to assess function: - Clone gene into expression vector - Measure phenotypic effect (growth, metabolite production, resistance) - Compare wildtype vs. allelic variants if polymorphisms encode different proteins ## Analysis Workflow **Step 1: Check Eligibility** ```bash cat eligibility_table_pre_post-treatment_control.tsv ``` Identify MAGs with sufficient coverage for statistical tests. **Step 2: Examine Scores** ```bash # MAG-level scores head scores_two_sample_unpaired-pre_post-treatment_control-MAGs.tsv # Taxonomic aggregation (family level) head scores_two_sample_unpaired-pre_post-treatment_control-family.tsv ``` Focus on MAGs/taxa with high parallelism or divergence scores. **Step 3: Investigate Genes** ```bash # Gene scores for a high-scoring MAG head MAG123_two_sample_unpaired_gene_scores_individual.tsv # Outlier genes head MAG123_two_sample_unpaired_outlier_genes.tsv ``` Identify candidate genes under strong selection. **Step 4: Compare Tests** Check consistency across statistical approaches (two-sample, LMM, CMH). Genes significant in multiple tests are most robust. **Step 5: Functional Analysis** - Annotate outlier genes (KEGG, COG, Pfam) - Check biological relevance to experimental conditions - Consider genomic context (operons, mobile elements) ## Troubleshooting **No results / empty files** \- Check eligibility table: MAGs may not meet `min_sample_num` or `breadth_threshold` \- Verify input file paths in configuration \- Check log files in `logs/` directory **Low scores across all MAGs** \- Insufficient selective pressure or inappropriate timepoints \- Try lowering `p_value_threshold` (e.g., 0.1 instead of 0.05) \- Check if experimental conditions are strong enough **Inconsistent results between tests** \- LMM is sensitive to experimental design complexity \- Two-sample tests affected by unbalanced groups \- CMH best for detecting consistent directional changes \- Use multiple tests for robust conclusions **Missing gene IDs** \- Ensure Prodigal predictions match reference FASTA contig names \- Verify `prodigal_path` in configuration \- Check gene FASTA headers match contig naming For visualization of results, see [Visualization Guide](visualization_guide.md).