Input Files Reference¶
AlleleFlux requires specific input file formats. This reference details all requirements.
Core Input Files¶
BAM Files¶
Sorted, indexed BAM files from metagenomic alignments to reference MAGs.
Requirement |
Description |
|---|---|
Format |
Sorted |
Content |
Reads aligned to MAG contigs |
Quality |
Higher MAPQ → more reliable alleles |
Sample ID |
From metadata |
Reference FASTA¶
Combined FASTA containing all MAG contigs.
Property |
Specification |
|---|---|
Extension |
|
Header Format |
|
Example |
|
Create combined FASTA:
alleleflux-create-mag-mapping --dir mag_fastas/ --extension fa \
--output-fasta combined.fasta --output-mapping mapping.tsv
Prodigal Genes FASTA¶
Gene predictions in nucleotide format.
Property |
Specification |
|---|---|
Extension |
|
Format |
Prodigal nucleotide output ( |
Header Example |
|
Generate predictions:
# For combined FASTA (recommended)
prodigal -i combined.fasta -d genes.fna -a genes.faa -p meta
Metadata File¶
Tab-separated file with sample information.
Required columns:
Column |
Description |
|---|---|
|
Unique sample identifier |
|
Absolute path to BAM file |
|
Biological replicate/subject ID (for pairing samples) |
|
Experimental group (e.g., |
|
Replicate identifier within group (for CMH stratification) |
|
Timepoint (longitudinal only; e.g., |
Example:
sample_id bam_path subjectID group replicate time
S1 /data/sample1.bam mouse1 control A pre
S2 /data/sample2.bam mouse1 control A post
S3 /data/sample3.bam mouse2 treatment B pre
S4 /data/sample4.bam mouse2 treatment B post
Add BAM paths automatically:
alleleflux-add-bam-path --metadata meta.tsv --output meta_bam.tsv --bam-dir /data/bams/
MAG Mapping File¶
Maps contigs to MAG IDs.
Column |
Description |
|---|---|
|
Full contig name from reference FASTA |
|
MAG identifier |
Example:
contig_name mag_id
Bacteroides_001.fa_k141_1234 Bacteroides_001
Lachnospira_002.fa_k141_9012 Lachnospira_002
Auto-generated by alleleflux-create-mag-mapping
GTDB Taxonomy File¶
GTDB-Tk classification (optional, for annotation).
Property |
Specification |
|---|---|
File |
|
Required Columns |
|
Generate with GTDB-Tk:
gtdbtk classify_wf --genome_dir mags/ --out_dir gtdbtk/ --cpus 16
Profile Files (Generated by Workflow)¶
Profile files are created by alleleflux-profile and serve as input for analysis steps.
Path: profiles/{sample}/{sample}_{mag}_profiled.tsv.gz
Columns:
Column |
Type |
Description |
|---|---|---|
|
str |
Contig identifier |
|
int |
0-based genomic position |
|
str |
Reference base (A/C/G/T) |
|
int |
Total read depth |
|
int |
Base counts |
|
str |
Overlapping gene (empty if intergenic) |
Pre-Run Checklist¶
Verify before running:
☑ BAM files sorted and indexed (
.baiexists)☑ Reference FASTA headers:
<MAG_ID>.fa_<contig_ID>☑ Prodigal genes match reference contigs
☑ Metadata has all required columns
☑ BAM paths are absolute paths
☑ Longitudinal: each
subjectIDhas all timepoints☑
≥ min_sample_numsamples per group☑ GTDB file contains all MAG IDs (if used)
☑ MAG mapping covers all contigs
Common Issues¶
Missing BAM indices
# Index all BAM files
for bam in *.bam; do samtools index $bam; done
Incorrect header format
# Check FASTA headers
grep ">" combined.fasta | head
# Should match: >MAG_ID.fa_contigID
Metadata path issues
Use absolute paths for bam_path:
# Convert relative to absolute
sed -i "s|^|$(pwd)/|" metadata.tsv
See also: Input Preparation, Configuration Reference