Output Files Reference

AlleleFlux generates structured outputs organized by analysis type and data type.

Output Directory Structure

{root_dir}/
└── {data_type}/                    # "single" or "longitudinal"
    ├── profiles/                   # Sample profiles (per MAG)
    ├── inputMetadata/              # MAG-sample mappings
    ├── QC/                         # Quality control metrics
    ├── eligibility_table_*.tsv     # MAG test eligibility
    ├── allele_analysis/            # Allele frequencies
    ├── significance_tests/         # Statistical test results
    │   ├── two_sample_unpaired_*/
    │   ├── two_sample_paired_*/
    │   ├── single_sample_*/
    │   ├── lmm_*/
    │   └── cmh_*/
    ├── scores/                     # Parallelism & divergence scores
    │   ├── intermediate/MAG_scores_*/
    │   └── processed/
    │       ├── combined/           # MAG-level summaries
    │       └── gene_scores_*/      # Gene-level summaries
    └── outlier_genes/              # Outlier gene detection

Core Output Files

Profile Files

Path: profiles/{sample}/{sample}_{mag}_profiled.tsv.gz

Base-level coverage and allele counts per sample-MAG pair.

Column

Type

Description

contig

str

Contig identifier

position

int

0-based position

ref_base

str

Reference base (A/C/G/T/N)

total_coverage

int

Total read depth

A, C, G, T

int

Base counts

gene_id

str

Overlapping gene (null if intergenic)

Quality Control Files

Path: QC/QC_{timepoints}-{groups}/{mag}_qc.tsv

Sample-level QC metrics.

Column

Type

Description

sample_id

str

Sample identifier

breadth_of_coverage

float

Fraction of genome covered (0-1)

mean_coverage

float

Average depth

passed_breadth

bool

Whether sample passed QC

Eligibility Table

Path: eligibility_table_{timepoints}-{groups}.tsv

Determines which MAGs qualify for each statistical test.

Column

Description

mag_id

MAG identifier

unpaired_test_eligible

Eligible for unpaired tests and LMM

paired_test_eligible

Eligible for paired tests and CMH

single_sample_eligible_{group}

Per-group single-sample eligibility

Allele Frequency Files

Path: allele_analysis/allele_analysis_{timepoints}-{groups}/{mag}_allele_frequency_*.tsv.gz

Position-level allele frequencies across samples.

Column

Type

Description

contig

str

Contig identifier

position

int

0-based position

ref_base

str

Reference base

{sample}_allele_freq

float

Allele frequency in sample (0-1)

{sample}_alt_allele

str

Most common non-reference allele

{sample}_coverage

int

Read depth

gene_id

str

Overlapping gene

Statistical Test Results

See also

For detailed information about all statistical tests and score calculation formulas, see Statistical Tests Reference.

How Scores Are Calculated:

For most tests (two-sample, LMM, single-sample), the score represents the percentage of significant sites:

Score (%) = (Significant Sites / Total Sites) × 100

For CMH tests, the score uses differential significance between timepoints (see CMH section below).

Two-Sample Tests

Paths:

  • significance_tests/two_sample_unpaired_{timepoints}-{groups}/{mag}_*.tsv.gz

  • significance_tests/two_sample_paired_{timepoints}-{groups}/{mag}_*.tsv.gz

Column

Type

Description

contig, position

str, int

Genomic location

gene_id

str

Overlapping gene

tTest_p_value

float

T-test p-value

mannwhitneyu_p_value

float

Mann-Whitney U p-value

mean_diff

float

Mean allele frequency difference

cohen_d

float

Effect size

Single-Sample Test

Path: significance_tests/single_sample_{timepoints}-{groups}/{mag}_*.tsv.gz

Tests deviation from reference within each group.

Column

Description

avg_allele_freq_{group}

Mean allele frequency in group

tTest_p_value_{group}

One-sample t-test p-value

CMH Test

Path: significance_tests/cmh_{timepoints}-{groups}/{mag}_*.tsv.gz

Cochran-Mantel-Haenszel test stratified by replicate/timepoint.

Column

Description

p_value_CMH

CMH test p-value

time

Timepoint identifier (for longitudinal data)

mode

across-time or across-group

Stratum columns

Allele counts per stratum (replicate or timepoint)

CMH Score Calculation (Differential Significance):

Unlike other tests, CMH scores measure sites that become significant only at the focus timepoint:

  1. Identify common sites: Positions present in results from BOTH timepoints

  2. Find differential sites: Sites where p_value_CMH < threshold at focus timepoint BUT NOT at the other timepoint

  3. Calculate percentage: Score (%) = (Differential Sites / Common Sites) × 100

Mathematical formula:

Let S_focus = sites significant at focus timepoint
Let S_other = sites significant at other timepoint
Let S_common = sites in both timepoints

Differential = (S_focus - S_other) ∩ S_common
Score = |Differential| / |S_common| × 100

Interpretation:

  • High scores indicate strong timepoint-specific selection at the focus timepoint

  • The focus timepoint (typically the later/derived state) determines directionality

  • Example: If focus: post, the score measures sites that became significant from pre→post

LMM Test

Path: significance_tests/lmm_{timepoints}-{groups}/{mag}_*.tsv.gz

Linear mixed-effects model for longitudinal data.

Column

Description

lmm_p_value

Fixed-effect p-value

coefficient

Estimated effect size

Score Files

MAG-Level Scores

Path: scores/processed/combined/MAG/scores_{test_type}-{timepoints}-{groups}*.tsv

Evolutionary significance scores per MAG for each statistical test.

Standard Tests (two-sample, LMM, single-sample, etc.):

Column

Description

MAG_ID

MAG identifier

total_sites_per_group_{test}

Total genomic positions analyzed

significant_sites_per_group_{test}

Number of sites with p < threshold

score_{test} (%)

Percentage of significant sites: (significant/total) × 100

Taxonomy columns

Domain, phylum, class, order, family, genus, species

grouped_by

Grouping level (e.g., “MAG_ID”)

CMH Scores (special calculation):

Column

Description

MAG_ID

MAG identifier

focus_timepoint

Which timepoint was designated as focus (derived state)

total_sites_per_group_CMH

Number of common sites across both timepoints

significant_sites_per_group_CMH

Differential significant sites (focus only)

score_CMH (%)

Percentage: (differential sites / common sites) × 100

Taxonomy columns

Domain, phylum, class, order, family, genus, species

grouped_by

Grouping level

Understanding the scores:

  • Higher scores indicate more genomic positions showing evolutionary signatures

  • Standard tests: Direct measure of selection strength across the genome

  • CMH scores: Measure of timepoint-specific directional changes

Gene-Level Scores

Path: scores/processed/gene_scores_{timepoints}-{groups}/{test_type}_gene_scores.tsv.gz

Scores aggregated by gene.

Column

Description

mag_id, gene_id

Identifiers

gene_parallelism_score

Mean parallelism across gene positions

gene_divergence_score

Mean divergence across gene positions

num_significant_positions

Count of significant sites in gene

Outlier Gene Files

Path: outlier_genes/{timepoints}-{groups}/{test_type}_outlier_genes.tsv.gz

Genes with exceptionally high scores (potential adaptive targets).

Column

Description

mag_id, gene_id

Identifiers

parallelism_score

Gene-level parallelism score

outlier_type

parallelism, divergence, or combined

z_score

Standard deviations from MAG mean

dN/dS Analysis Outputs

Generated by alleleflux-dnds-from-timepoints (see dN/dS Analysis Guide).

Codon Events: {mag}_codon_events_ng86.tsv.gz – Path-averaged S/NS counts per codon

Gene Summary: {mag}_gene_summary_ng86.tsv.gz – dN/dS ratios per gene

MAG Summary: {mag}_mag_summary_ng86.tsv.gz – Overall MAG dN/dS

Global Summary: {mag}_global_summary_ng86.tsv – Aggregate statistics

Key columns:

  • dN_dS: dN/dS ratio (>1 = positive selection, <1 = purifying)

  • potential_S, potential_N: Expected synonymous/non-synonymous sites

  • observed_S, observed_N: Fractional observed counts (path-averaged)

  • k: Number of positions changed in codon (1, 2, or 3)

File Format Notes

  • Most files are gzip-compressed TSV (.tsv.gz)

  • Position numbering is 0-based

  • Missing values: NaN or empty string

  • p-values: [0, 1] range; significant sites typically p < 0.05

  • Allele frequencies: [0, 1] range (proportion of reads)

See also: Interpreting Results, CLI Reference