Input Files Reference¶

AlleleFlux requires specific input file formats. This reference details all requirements.

Core Input Files¶

BAM Files¶

Sorted, indexed BAM files from metagenomic alignments to reference MAGs.

Requirement	Description
Format	Sorted `.bam` with `.bam.bai` index
Content	Reads aligned to MAG contigs
Quality	Higher MAPQ → more reliable alleles
Sample ID	From metadata `bam_path` or filename

Reference FASTA¶

Combined FASTA containing all MAG contigs.

Property	Specification
Extension	`.fa` or `.fasta`
Header Format	`<MAG_ID>.fa_<contig_ID>`
Example	`>Bacteroides_001.fa_k141_1234`

Create combined FASTA:

alleleflux-create-mag-mapping --dir mag_fastas/ --extension fa \
                              --output-fasta combined.fasta --output-mapping mapping.tsv

Prodigal Genes FASTA¶

Gene predictions in nucleotide format.

Property	Specification
Extension	`.fna`
Format	Prodigal nucleotide output (`-d` flag)
Header Example	`>Bacteroides_001.fa_k141_1234_1 # 100 # 450 # 1 # ID=1_1;...`

Generate predictions:

# For combined FASTA (recommended)
prodigal -i combined.fasta -d genes.fna -a genes.faa -p meta

Metadata File¶

Tab-separated file with sample information.

Required columns:

Column	Description
`sample_id`	Unique sample identifier
`bam_path`	Absolute path to BAM file
`subjectID`	Biological replicate/subject ID (for pairing samples)
`group`	Experimental group (e.g., `treatment`, `control`)
`replicate`	Replicate identifier within group (for CMH stratification)
`time`	Timepoint (longitudinal only; e.g., `pre`, `post`)

Example:

sample_id   bam_path        subjectID       group   replicate       time
S1  /data/sample1.bam       mouse1  control A       pre
S2  /data/sample2.bam       mouse1  control A       post
S3  /data/sample3.bam       mouse2  treatment       B       pre
S4  /data/sample4.bam       mouse2  treatment       B       post

Add BAM paths automatically:

alleleflux-add-bam-path --metadata meta.tsv --output meta_bam.tsv --bam-dir /data/bams/

MAG Mapping File¶

Maps contigs to MAG IDs.

Column	Description
`contig_name`	Full contig name from reference FASTA
`mag_id`	MAG identifier

Example:

contig_name mag_id
Bacteroides_001.fa_k141_1234        Bacteroides_001
Lachnospira_002.fa_k141_9012        Lachnospira_002

Auto-generated by alleleflux-create-mag-mapping

GTDB Taxonomy File¶

GTDB-Tk classification (optional, for annotation).

Property	Specification
File	`gtdbtk.bac120.summary.tsv`
Required Columns	`user_genome` (MAG ID), `classification` (GTDB taxonomy)

Generate with GTDB-Tk:

gtdbtk classify_wf --genome_dir mags/ --out_dir gtdbtk/ --cpus 16

Profile Files (Generated by Workflow)¶

Profile files are created by alleleflux-profile and serve as input for analysis steps.

Path: profiles/{sample}/{sample}_{mag}_profiled.tsv.gz

Columns:

Column	Type	Description
`contig`	str	Contig identifier
`position`	int	0-based genomic position
`ref_base`	str	Reference base (A/C/G/T)
`total_coverage`	int	Total read depth
`A`, `C`, `G`, `T`	int	Base counts
`gene_id`	str	Overlapping gene (empty if intergenic)

Pre-Run Checklist¶

Verify before running:

☑ BAM files sorted and indexed (.bai exists)
☑ Reference FASTA headers: <MAG_ID>.fa_<contig_ID>
☑ Prodigal genes match reference contigs
☑ Metadata has all required columns
☑ BAM paths are absolute paths
☑ Longitudinal: each subjectID has all timepoints
☑ ≥ min_sample_num samples per group
☑ GTDB file contains all MAG IDs (if used)
☑ MAG mapping covers all contigs

Common Issues¶

Missing BAM indices

# Index all BAM files
for bam in *.bam; do samtools index $bam; done

Incorrect header format

# Check FASTA headers
grep ">" combined.fasta | head
# Should match: >MAG_ID.fa_contigID

Metadata path issues

Use absolute paths for bam_path:

# Convert relative to absolute
sed -i "s|^|$(pwd)/|" metadata.tsv