Output Files Reference¶
AlleleFlux generates structured outputs organized by analysis type and data type.
Output Directory Structure¶
{root_dir}/
└── {data_type}/ # "single" or "longitudinal"
├── profiles/ # Sample profiles (per MAG)
├── inputMetadata/ # MAG-sample mappings
├── QC/ # Quality control metrics
├── eligibility_table_*.tsv # MAG test eligibility
├── allele_analysis/ # Allele frequencies
├── significance_tests/ # Statistical test results
│ ├── two_sample_unpaired_*/
│ ├── two_sample_paired_*/
│ ├── single_sample_*/
│ ├── lmm_*/
│ └── cmh_*/
├── scores/ # Parallelism & divergence scores
│ ├── intermediate/MAG_scores_*/
│ └── processed/
│ ├── combined/ # MAG-level summaries
│ └── gene_scores_*/ # Gene-level summaries
└── outlier_genes/ # Outlier gene detection
Core Output Files¶
Profile Files¶
Path: profiles/{sample}/{sample}_{mag}_profiled.tsv.gz
Base-level coverage and allele counts per sample-MAG pair.
Column |
Type |
Description |
|---|---|---|
|
str |
Contig identifier |
|
int |
0-based position |
|
str |
Reference base (A/C/G/T/N) |
|
int |
Total read depth |
|
int |
Base counts |
|
str |
Overlapping gene (null if intergenic) |
Quality Control Files¶
Path: QC/QC_{timepoints}-{groups}/{mag}_qc.tsv
Sample-level QC metrics.
Column |
Type |
Description |
|---|---|---|
|
str |
Sample identifier |
|
float |
Fraction of genome covered (0-1) |
|
float |
Average depth |
|
bool |
Whether sample passed QC |
Eligibility Table¶
Path: eligibility_table_{timepoints}-{groups}.tsv
Determines which MAGs qualify for each statistical test.
Column |
Description |
|---|---|
|
MAG identifier |
|
Eligible for unpaired tests and LMM |
|
Eligible for paired tests and CMH |
|
Per-group single-sample eligibility |
Allele Frequency Files¶
Path: allele_analysis/allele_analysis_{timepoints}-{groups}/{mag}_allele_frequency_*.tsv.gz
Position-level allele frequencies across samples.
Column |
Type |
Description |
|---|---|---|
|
str |
Contig identifier |
|
int |
0-based position |
|
str |
Reference base |
|
float |
Allele frequency in sample (0-1) |
|
str |
Most common non-reference allele |
|
int |
Read depth |
|
str |
Overlapping gene |
Statistical Test Results¶
See also
For detailed information about all statistical tests and score calculation formulas, see Statistical Tests Reference.
How Scores Are Calculated:
For most tests (two-sample, LMM, single-sample), the score represents the percentage of significant sites:
Score (%) = (Significant Sites / Total Sites) × 100
For CMH tests, the score uses differential significance between timepoints (see CMH section below).
Two-Sample Tests¶
Paths:
significance_tests/two_sample_unpaired_{timepoints}-{groups}/{mag}_*.tsv.gzsignificance_tests/two_sample_paired_{timepoints}-{groups}/{mag}_*.tsv.gz
Column |
Type |
Description |
|---|---|---|
|
str, int |
Genomic location |
|
str |
Overlapping gene |
|
float |
T-test p-value |
|
float |
Mann-Whitney U p-value |
|
float |
Mean allele frequency difference |
|
float |
Effect size |
Single-Sample Test¶
Path: significance_tests/single_sample_{timepoints}-{groups}/{mag}_*.tsv.gz
Tests deviation from reference within each group.
Column |
Description |
|---|---|
|
Mean allele frequency in group |
|
One-sample t-test p-value |
CMH Test¶
Path: significance_tests/cmh_{timepoints}-{groups}/{mag}_*.tsv.gz
Cochran-Mantel-Haenszel test stratified by replicate/timepoint.
Column |
Description |
|---|---|
|
CMH test p-value |
|
Timepoint identifier (for longitudinal data) |
|
|
Stratum columns |
Allele counts per stratum (replicate or timepoint) |
CMH Score Calculation (Differential Significance):
Unlike other tests, CMH scores measure sites that become significant only at the focus timepoint:
Identify common sites: Positions present in results from BOTH timepoints
Find differential sites: Sites where
p_value_CMH < thresholdat focus timepoint BUT NOT at the other timepointCalculate percentage:
Score (%) = (Differential Sites / Common Sites) × 100
Mathematical formula:
Let S_focus = sites significant at focus timepoint
Let S_other = sites significant at other timepoint
Let S_common = sites in both timepoints
Differential = (S_focus - S_other) ∩ S_common
Score = |Differential| / |S_common| × 100
Interpretation:
High scores indicate strong timepoint-specific selection at the focus timepoint
The focus timepoint (typically the later/derived state) determines directionality
Example: If
focus: post, the score measures sites that became significant from pre→post
LMM Test¶
Path: significance_tests/lmm_{timepoints}-{groups}/{mag}_*.tsv.gz
Linear mixed-effects model for longitudinal data.
Column |
Description |
|---|---|
|
Fixed-effect p-value |
|
Estimated effect size |
Score Files¶
MAG-Level Scores¶
Path: scores/processed/combined/MAG/scores_{test_type}-{timepoints}-{groups}*.tsv
Evolutionary significance scores per MAG for each statistical test.
Standard Tests (two-sample, LMM, single-sample, etc.):
Column |
Description |
|---|---|
|
MAG identifier |
|
Total genomic positions analyzed |
|
Number of sites with p < threshold |
|
Percentage of significant sites: |
Taxonomy columns |
Domain, phylum, class, order, family, genus, species |
|
Grouping level (e.g., “MAG_ID”) |
CMH Scores (special calculation):
Column |
Description |
|---|---|
|
MAG identifier |
|
Which timepoint was designated as focus (derived state) |
|
Number of common sites across both timepoints |
|
Differential significant sites (focus only) |
|
Percentage: |
Taxonomy columns |
Domain, phylum, class, order, family, genus, species |
|
Grouping level |
Understanding the scores:
Higher scores indicate more genomic positions showing evolutionary signatures
Standard tests: Direct measure of selection strength across the genome
CMH scores: Measure of timepoint-specific directional changes
Gene-Level Scores¶
Path: scores/processed/gene_scores_{timepoints}-{groups}/{test_type}_gene_scores.tsv.gz
Scores aggregated by gene.
Column |
Description |
|---|---|
|
Identifiers |
|
Mean parallelism across gene positions |
|
Mean divergence across gene positions |
|
Count of significant sites in gene |
Outlier Gene Files¶
Path: outlier_genes/{timepoints}-{groups}/{test_type}_outlier_genes.tsv.gz
Genes with exceptionally high scores (potential adaptive targets).
Column |
Description |
|---|---|
|
Identifiers |
|
Gene-level parallelism score |
|
|
|
Standard deviations from MAG mean |
dN/dS Analysis Outputs¶
Generated by alleleflux-dnds-from-timepoints (see dN/dS Analysis Guide).
Codon Events: {mag}_codon_events_ng86.tsv.gz – Path-averaged S/NS counts per codon
Gene Summary: {mag}_gene_summary_ng86.tsv.gz – dN/dS ratios per gene
MAG Summary: {mag}_mag_summary_ng86.tsv.gz – Overall MAG dN/dS
Global Summary: {mag}_global_summary_ng86.tsv – Aggregate statistics
Key columns:
dN_dS: dN/dS ratio (>1 = positive selection, <1 = purifying)potential_S,potential_N: Expected synonymous/non-synonymous sitesobserved_S,observed_N: Fractional observed counts (path-averaged)k: Number of positions changed in codon (1, 2, or 3)
File Format Notes¶
Most files are gzip-compressed TSV (
.tsv.gz)Position numbering is 0-based
Missing values:
NaNor empty stringp-values: [0, 1] range; significant sites typically p < 0.05
Allele frequencies: [0, 1] range (proportion of reads)
See also: Interpreting Results, CLI Reference