# Interpreting Results This guide explains how to interpret the results produced by AlleleFlux. ## Output Structure AlleleFlux organizes results by analysis type: ```text output/ ├── profiles/ # Per-sample allele counts ├── metadata/ # Per-MAG sample metadata ├── QC/ # Quality control results ├── eligibility_table_*.tsv # MAG eligibility for tests ├── allele_analysis/ # Allele frequency analysis ├── significance_tests/ # Statistical test results │ ├── two_sample_unpaired/ │ ├── two_sample_paired/ │ ├── single_sample/ │ ├── lmm/ │ └── cmh/ ├── scores/ │ ├── intermediate/ # Per-MAG scores │ └── processed/ │ ├── combined/ # Aggregated MAG/taxa scores │ └── gene_scores/ # Gene-level scores └── outliers/ # High-scoring outlier genes ``` ## Key Files **1. Eligibility Table** `eligibility_table_{timepoints}-{groups}.tsv` - Which MAGs qualify for each test based on coverage/samples **2. Statistical Tests** (in `significance_tests/`) Per-MAG files with p-values and test statistics: \- `{mag}_two_sample_unpaired.tsv.gz` - Unpaired group comparisons \- `{mag}_lmm.tsv.gz` - Linear mixed models \- `{mag}_cmh.tsv.gz` - Cochran-Mantel-Haenszel tests Key columns: `contig`, `position`, `gene_id`, `p_value_{test}`, `q_value_{test}` **3. Scores** (in `scores/processed/combined/`) - `scores_{test}-{tp}-{gr}-MAGs.tsv` - MAG-level parallelism/divergence scores - `scores_{test}-{tp}-{gr}-{taxon}.tsv` - Taxonomic aggregations (phylum to species) **4. Gene Scores** (in `scores/processed/gene_scores/`) - `{mag}_{test}_gene_scores_individual.tsv` - Per-gene scores - `{mag}_{test}_outlier_genes.tsv` - High-scoring genes under selection ## Score Interpretation **Parallelism Score** (0-100%) Measures consistent allele changes across replicates within a group. High scores → deterministic evolution (not random drift). **Divergence Score** (0-100%) Quantifies allele frequency differences between groups. High scores → differential selection between conditions. **CMH Test** Detects parallel allele changes across timepoints while controlling for individual variation. Particularly powerful for longitudinal studies. ## File Format Details **Profile files** (`profiles/{sample}_{mag}_profiled.tsv.gz`): `contig`, `position`, `ref_base`, `total_coverage`, `A`, `C`, `G`, `T`, `gene_id` **Statistical test results** (`significance_tests/{test}/{mag}_{test}.tsv.gz`): `contig`, `position`, `gene_id`, `p_value_{test}`, `q_value_{test}` **Gene scores** (`scores/processed/gene_scores/{mag}_{test}_gene_scores_individual.tsv`): `gene_id`, `total_sites`, `significant_sites`, `score_%` **Outliers** (`outliers/{mag}_{test}_outlier_genes.tsv`): `gene_id`, `gene_score_%`, `mag_score_%`, `p_value_binomial`, `p_value_poisson` ## Analysis Workflow **Step 1: Check Eligibility** ```bash cat eligibility_table_pre_post-treatment_control.tsv ``` Identify MAGs with sufficient coverage for statistical tests. **Step 2: Examine Scores** ```bash # MAG-level scores head scores_two_sample_unpaired-pre_post-treatment_control-MAGs.tsv # Taxonomic aggregation (family level) head scores_two_sample_unpaired-pre_post-treatment_control-family.tsv ``` Focus on MAGs/taxa with high parallelism or divergence scores. **Step 3: Investigate Genes** ```bash # Gene scores for a high-scoring MAG head MAG123_two_sample_unpaired_gene_scores_individual.tsv # Outlier genes head MAG123_two_sample_unpaired_outlier_genes.tsv ``` Identify candidate genes under strong selection. **Step 4: Compare Tests** Check consistency across statistical approaches (two-sample, LMM, CMH). Genes significant in multiple tests are most robust. **Step 5: Functional Analysis** - Annotate outlier genes (KEGG, COG, Pfam) - Check biological relevance to experimental conditions - Consider genomic context (operons, mobile elements) ## Troubleshooting **No results / empty files** \- Check eligibility table: MAGs may not meet `min_sample_num` or `breadth_threshold` \- Verify input file paths in configuration \- Check log files in `logs/` directory **Low scores across all MAGs** \- Insufficient selective pressure or inappropriate timepoints \- Try lowering `p_value_threshold` (e.g., 0.1 instead of 0.05) \- Check if experimental conditions are strong enough **Inconsistent results between tests** \- LMM is sensitive to experimental design complexity \- Two-sample tests affected by unbalanced groups \- CMH best for detecting consistent directional changes \- Use multiple tests for robust conclusions **Missing gene IDs** \- Ensure Prodigal predictions match reference FASTA contig names \- Verify `prodigal_path` in configuration \- Check gene FASTA headers match contig naming For visualization of results, see [Visualization Guide](visualization_guide.md).