Interpreting Results¶

This guide explains how to interpret the results produced by AlleleFlux.

Output Structure¶

AlleleFlux organizes results by analysis type:

output/
├── profiles/               # Per-sample allele counts
├── metadata/               # Per-MAG sample metadata
├── QC/                     # Quality control results
├── eligibility_table_*.tsv # MAG eligibility for tests
├── allele_analysis/        # Allele frequency analysis
├── significance_tests/     # Statistical test results
│   ├── two_sample_unpaired/
│   ├── two_sample_paired/
│   ├── single_sample/
│   ├── lmm/
│   └── cmh/
├── scores/
│   ├── intermediate/        # Per-MAG scores
│   └── processed/
│       ├── combined/        # Aggregated MAG/taxa scores
│       └── gene_scores/     # Gene-level scores
└── outliers/               # High-scoring outlier genes

Key Files¶

1. Eligibility Table

eligibility_table_{timepoints}-{groups}.tsv - Which MAGs qualify for each test based on coverage/samples

2. Statistical Tests (in significance_tests/)

Per-MAG files with p-values and test statistics: - {mag}_two_sample_unpaired.tsv.gz - Unpaired group comparisons - {mag}_lmm.tsv.gz - Linear mixed models - {mag}_cmh.tsv.gz - Cochran-Mantel-Haenszel tests

Key columns: contig, position, gene_id, p_value_{test}, q_value_{test}

3. Scores (in scores/processed/combined/)

scores_{test}-{tp}-{gr}-MAGs.tsv - MAG-level parallelism/divergence scores
scores_{test}-{tp}-{gr}-{taxon}.tsv - Taxonomic aggregations (phylum to species)

4. Gene Scores (in scores/processed/gene_scores/)

{mag}_{test}_gene_scores_individual.tsv - Per-gene scores
{mag}_{test}_outlier_genes.tsv - High-scoring genes under selection

Score Interpretation¶

Parallelism Score (0-100%) Measures consistent allele changes across replicates within a group. High scores → deterministic evolution (not random drift).

Divergence Score (0-100%) Quantifies allele frequency differences between groups. High scores → differential selection between conditions.

CMH Test Detects parallel allele changes across timepoints while controlling for individual variation. Particularly powerful for longitudinal studies.

File Format Details¶

Profile files (profiles/{sample}_{mag}_profiled.tsv.gz): contig, position, ref_base, total_coverage, A, C, G, T, gene_id

Statistical test results (significance_tests/{test}/{mag}_{test}.tsv.gz): contig, position, gene_id, p_value_{test}, q_value_{test}

Gene scores (scores/processed/gene_scores/{mag}_{test}_gene_scores_individual.tsv): gene_id, total_sites, significant_sites, score_%

Outliers (outliers/{mag}_{test}_outlier_genes.tsv): gene_id, gene_score_%, mag_score_%, p_value_binomial, p_value_poisson

Analysis Workflow¶

Step 1: Check Eligibility

cat eligibility_table_pre_post-treatment_control.tsv

Identify MAGs with sufficient coverage for statistical tests.

Step 2: Examine Scores

# MAG-level scores
head scores_two_sample_unpaired-pre_post-treatment_control-MAGs.tsv

# Taxonomic aggregation (family level)
head scores_two_sample_unpaired-pre_post-treatment_control-family.tsv

Focus on MAGs/taxa with high parallelism or divergence scores.

Step 3: Investigate Genes

# Gene scores for a high-scoring MAG
head MAG123_two_sample_unpaired_gene_scores_individual.tsv

# Outlier genes
head MAG123_two_sample_unpaired_outlier_genes.tsv

Identify candidate genes under strong selection.

Step 4: Compare Tests

Check consistency across statistical approaches (two-sample, LMM, CMH). Genes significant in multiple tests are most robust.

Step 5: Functional Analysis

Annotate outlier genes (KEGG, COG, Pfam)
Check biological relevance to experimental conditions
Consider genomic context (operons, mobile elements)

Troubleshooting¶

No results / empty files - Check eligibility table: MAGs may not meet min_sample_num or breadth_threshold - Verify input file paths in configuration - Check log files in logs/ directory

Low scores across all MAGs - Insufficient selective pressure or inappropriate timepoints - Try lowering p_value_threshold (e.g., 0.1 instead of 0.05) - Check if experimental conditions are strong enough

Inconsistent results between tests - LMM is sensitive to experimental design complexity - Two-sample tests affected by unbalanced groups - CMH best for detecting consistent directional changes - Use multiple tests for robust conclusions

Missing gene IDs - Ensure Prodigal predictions match reference FASTA contig names - Verify prodigal_path in configuration - Check gene FASTA headers match contig naming

For visualization of results, see Visualization Guide.