# Input Files Reference AlleleFlux requires specific input file formats. This reference details all requirements. ## Core Input Files ### BAM Files Sorted, indexed BAM files from metagenomic alignments to reference MAGs. | Requirement | Description | |-------------|-------------| | Format | Sorted `.bam` with `.bam.bai` index | | Content | Reads aligned to MAG contigs | | Quality | Higher MAPQ → more reliable alleles | | Sample ID | From metadata `bam_path` or filename | ### Reference FASTA Combined FASTA containing all MAG contigs. | Property | Specification | |----------|---------------| | Extension | `.fa` or `.fasta` | | Header Format | `.fa_` | | Example | `>Bacteroides_001.fa_k141_1234` | **Create combined FASTA:** ```bash alleleflux-create-mag-mapping --dir mag_fastas/ --extension fa \ --output-fasta combined.fasta --output-mapping mapping.tsv ``` ### Prodigal Genes FASTA Gene predictions in nucleotide format. | Property | Specification | |----------|---------------| | Extension | `.fna` | | Format | Prodigal nucleotide output (`-d` flag) | | Header Example | `>Bacteroides_001.fa_k141_1234_1 # 100 # 450 # 1 # ID=1_1;...` | **Generate predictions:** ```bash # For combined FASTA (recommended) prodigal -i combined.fasta -d genes.fna -a genes.faa -p meta ``` ### Metadata File Tab-separated file with sample information. **Required columns:** | Column | Description | |--------|-------------| | `sample_id` | Unique sample identifier | | `bam_path` | Absolute path to BAM file | | `subjectID` | Biological replicate/subject ID (for pairing samples) | | `group` | Experimental group (e.g., `treatment`, `control`) | | `replicate` | Replicate identifier within group (for CMH stratification) | | `time` | Timepoint (longitudinal only; e.g., `pre`, `post`) | **Example:** ```text sample_id bam_path subjectID group replicate time S1 /data/sample1.bam mouse1 control A pre S2 /data/sample2.bam mouse1 control A post S3 /data/sample3.bam mouse2 treatment B pre S4 /data/sample4.bam mouse2 treatment B post ``` **Add BAM paths automatically:** ```bash alleleflux-add-bam-path --metadata meta.tsv --output meta_bam.tsv --bam-dir /data/bams/ ``` ### MAG Mapping File Maps contigs to MAG IDs. | Column | Description | |--------|-------------| | `contig_name` | Full contig name from reference FASTA | | `mag_id` | MAG identifier | **Example:** ```text contig_name mag_id Bacteroides_001.fa_k141_1234 Bacteroides_001 Lachnospira_002.fa_k141_9012 Lachnospira_002 ``` **Auto-generated by** `alleleflux-create-mag-mapping` ### GTDB Taxonomy File GTDB-Tk classification (optional, for annotation). | Property | Specification | |----------|---------------| | File | `gtdbtk.bac120.summary.tsv` | | Required Columns | `user_genome` (MAG ID), `classification` (GTDB taxonomy) | **Generate with GTDB-Tk:** ```bash gtdbtk classify_wf --genome_dir mags/ --out_dir gtdbtk/ --cpus 16 ``` ## Profile Files (Generated by Workflow) Profile files are created by `alleleflux-profile` and serve as input for analysis steps. **Path:** `profiles/{sample}/{sample}_{mag}_profiled.tsv.gz` **Columns:** | Column | Type | Description | |--------|------|-------------| | `contig` | str | Contig identifier | | `position` | int | 0-based genomic position | | `ref_base` | str | Reference base (A/C/G/T) | | `total_coverage` | int | Total read depth | | `A`, `C`, `G`, `T` | int | Base counts | | `gene_id` | str | Overlapping gene (empty if intergenic) | ## Pre-Run Checklist Verify before running: - ☑ BAM files sorted and indexed (`.bai` exists) - ☑ Reference FASTA headers: `.fa_` - ☑ Prodigal genes match reference contigs - ☑ Metadata has all required columns - ☑ BAM paths are absolute paths - ☑ Longitudinal: each `subjectID` has all timepoints - ☑ `≥ min_sample_num` samples per group - ☑ GTDB file contains all MAG IDs (if used) - ☑ MAG mapping covers all contigs ## Common Issues **Missing BAM indices** ```bash # Index all BAM files for bam in *.bam; do samtools index $bam; done ``` **Incorrect header format** ```bash # Check FASTA headers grep ">" combined.fasta | head # Should match: >MAG_ID.fa_contigID ``` **Metadata path issues** Use absolute paths for `bam_path`: ```bash # Convert relative to absolute sed -i "s|^|$(pwd)/|" metadata.tsv ``` See also: [Input Preparation](../usage/input_preparation.md), [Configuration Reference](configuration.md)