Input Preparation¶

AlleleFlux requires several input files. This guide covers preparation and formatting.

Required Files¶

File	Format	Description
BAM files	`.bam` + `.bai`	Sorted and indexed alignments of metagenomic reads to MAGs
Reference FASTA	`.fa` / `.fasta`	Combined MAG contigs. Header format: `<MAG_ID>.fa_<contig_ID>`
Prodigal genes	`.fna`	Nucleotide ORF predictions matching reference contig IDs
Metadata TSV	`.tsv`	Sample information with `sample_id`, `bam_path`, `subjectID`, `group`, `replicate`. For longitudinal: add `time`
MAG mapping	`.tsv`	Contig → MAG assignments (`contig_name`, `mag_id`)
GTDB taxonomy	`.tsv` (optional)	`gtdbtk.bac120.summary.tsv` for taxonomic aggregation

Metadata File Format¶

Longitudinal Study:

sample_id  bam_path                  subjectID  group      replicate  time
S1         /data/S1.sorted.bam       mouse1     control    A          pre
S2         /data/S2.sorted.bam       mouse2     control    B          pre
S3         /data/S3.sorted.bam       mouse3     treatment  A          pre
S4         /data/S4.sorted.bam       mouse4     treatment  B          pre
S5         /data/S5.sorted.bam       mouse1     control    A          post
S6         /data/S6.sorted.bam       mouse2     control    B          post
S7         /data/S7.sorted.bam       mouse3     treatment  A          post
S8         /data/S8.sorted.bam       mouse4     treatment  B          post

Single Timepoint:

sample_id  bam_path                  subjectID  group    replicate
S1         /data/S1.sorted.bam       subject1   disease  A
S2         /data/S2.sorted.bam       subject2   disease  B
S3         /data/S3.sorted.bam       subject3   healthy  A
S4         /data/S4.sorted.bam       subject4   healthy  B

Minimal Configuration¶

Create config.yml with paths to your files:

data_type: "longitudinal"  # or "single"

input:
  fasta_path: "reference.fa"
  prodigal_path: "genes.fna"
  metadata_path: "metadata.tsv"
  mag_mapping_path: "mag_mapping.tsv"
  gtdb_path: "gtdbtk.tsv"  # optional

output:
  root_dir: "output/"

analysis:
  timepoints_combinations:
    - timepoint: ["pre", "post"]
      focus: "post"
  groups_combinations:
    - ["treatment", "control"]
  use_lmm: true
  use_significance_tests: true
  use_cmh: true

See Configuration Reference for all options.

Preparation Utilities¶

Create MAG mapping (combines individual MAG FASTAs):

alleleflux-create-mag-mapping --dir mag_fastas/ --extension fa \
    --output-fasta combined.fasta --output-mapping mapping.tsv

Add BAM paths to existing metadata:

alleleflux-add-bam-path --metadata metadata.tsv \
    --bam-dir bamfiles/ --output metadata_with_bam.tsv

Generate Prodigal predictions:

prodigal -i combined.fasta -d genes.fna -a genes.faa -p meta

For detailed options: alleleflux-create-mag-mapping --help

Next Steps¶

Once inputs are prepared:

Create configuration file: Configuration Reference
Run the pipeline: Running the Workflow
Examine example data: Example Data