Input Files Reference

AlleleFlux requires specific input file formats. This reference details all requirements.

Core Input Files

BAM Files

Sorted, indexed BAM files from metagenomic alignments to reference MAGs.

Requirement

Description

Format

Sorted .bam with .bam.bai index

Content

Reads aligned to MAG contigs

Quality

Higher MAPQ → more reliable alleles

Sample ID

From metadata bam_path or filename

Reference FASTA

Combined FASTA containing all MAG contigs.

Property

Specification

Extension

.fa or .fasta

Header Format

<MAG_ID>.fa_<contig_ID>

Example

>Bacteroides_001.fa_k141_1234

Create combined FASTA:

alleleflux-create-mag-mapping --dir mag_fastas/ --extension fa \
                              --output-fasta combined.fasta --output-mapping mapping.tsv

Prodigal Genes FASTA

Gene predictions in nucleotide format.

Property

Specification

Extension

.fna

Format

Prodigal nucleotide output (-d flag)

Header Example

>Bacteroides_001.fa_k141_1234_1 # 100 # 450 # 1 # ID=1_1;...

Generate predictions:

# For combined FASTA (recommended)
prodigal -i combined.fasta -d genes.fna -a genes.faa -p meta

Metadata File

Tab-separated file with sample information.

Required columns:

Column

Description

sample_id

Unique sample identifier

bam_path

Absolute path to BAM file

subjectID

Biological replicate/subject ID (for pairing samples)

group

Experimental group (e.g., treatment, control)

replicate

Replicate identifier within group (for CMH stratification)

time

Timepoint (longitudinal only; e.g., pre, post)

Example:

sample_id   bam_path        subjectID       group   replicate       time
S1  /data/sample1.bam       mouse1  control A       pre
S2  /data/sample2.bam       mouse1  control A       post
S3  /data/sample3.bam       mouse2  treatment       B       pre
S4  /data/sample4.bam       mouse2  treatment       B       post

Add BAM paths automatically:

alleleflux-add-bam-path --metadata meta.tsv --output meta_bam.tsv --bam-dir /data/bams/

MAG Mapping File

Maps contigs to MAG IDs.

Column

Description

contig_name

Full contig name from reference FASTA

mag_id

MAG identifier

Example:

contig_name mag_id
Bacteroides_001.fa_k141_1234        Bacteroides_001
Lachnospira_002.fa_k141_9012        Lachnospira_002

Auto-generated by alleleflux-create-mag-mapping

GTDB Taxonomy File

GTDB-Tk classification (optional, for annotation).

Property

Specification

File

gtdbtk.bac120.summary.tsv

Required Columns

user_genome (MAG ID), classification (GTDB taxonomy)

Generate with GTDB-Tk:

gtdbtk classify_wf --genome_dir mags/ --out_dir gtdbtk/ --cpus 16

Profile Files (Generated by Workflow)

Profile files are created by alleleflux-profile and serve as input for analysis steps.

Path: profiles/{sample}/{sample}_{mag}_profiled.tsv.gz

Columns:

Column

Type

Description

contig

str

Contig identifier

position

int

0-based genomic position

ref_base

str

Reference base (A/C/G/T)

total_coverage

int

Total read depth

A, C, G, T

int

Base counts

gene_id

str

Overlapping gene (empty if intergenic)

Pre-Run Checklist

Verify before running:

  • ☑ BAM files sorted and indexed (.bai exists)

  • ☑ Reference FASTA headers: <MAG_ID>.fa_<contig_ID>

  • ☑ Prodigal genes match reference contigs

  • ☑ Metadata has all required columns

  • ☑ BAM paths are absolute paths

  • ☑ Longitudinal: each subjectID has all timepoints

  • min_sample_num samples per group

  • ☑ GTDB file contains all MAG IDs (if used)

  • ☑ MAG mapping covers all contigs

Common Issues

Missing BAM indices

# Index all BAM files
for bam in *.bam; do samtools index $bam; done

Incorrect header format

# Check FASTA headers
grep ">" combined.fasta | head
# Should match: >MAG_ID.fa_contigID

Metadata path issues

Use absolute paths for bam_path:

# Convert relative to absolute
sed -i "s|^|$(pwd)/|" metadata.tsv

See also: Input Preparation, Configuration Reference