# Input Files Reference

AlleleFlux requires specific input file formats. This reference details all requirements.

## Core Input Files

### BAM Files

Sorted, indexed BAM files from metagenomic alignments to reference MAGs.

| Requirement | Description |
|-------------|-------------|
| Format | Sorted `.bam` with `.bam.bai` index |
| Content | Reads aligned to MAG contigs |
| Quality | Higher MAPQ → more reliable alleles |
| Sample ID | From metadata `bam_path` or filename |

### Reference FASTA

Combined FASTA containing all MAG contigs.

| Property | Specification |
|----------|---------------|
| Extension | `.fa` or `.fasta` |
| Header Format | `<MAG_ID>.fa_<contig_ID>` |
| Example | `>Bacteroides_001.fa_k141_1234` |

**Create combined FASTA:**

```bash
alleleflux-create-mag-mapping --dir mag_fastas/ --extension fa \
                              --output-fasta combined.fasta --output-mapping mapping.tsv
```

### Prodigal Genes FASTA

Gene predictions in nucleotide format.

| Property | Specification |
|----------|---------------|
| Extension | `.fna` |
| Format | Prodigal nucleotide output (`-d` flag) |
| Header Example | `>Bacteroides_001.fa_k141_1234_1 # 100 # 450 # 1 # ID=1_1;...` |

**Generate predictions:**

```bash
# For combined FASTA (recommended)
prodigal -i combined.fasta -d genes.fna -a genes.faa -p meta
```

### Metadata File

Tab-separated file with sample information.

**Required columns:**

| Column | Description |
|--------|-------------|
| `sample_id` | Unique sample identifier |
| `bam_path` | Absolute path to BAM file |
| `subjectID` | Biological replicate/subject ID (for pairing samples) |
| `group` | Experimental group (e.g., `treatment`, `control`) |
| `replicate` | Replicate identifier within group (for CMH stratification) |
| `time` | Timepoint (longitudinal only; e.g., `pre`, `post`) |

**Example:**

```text
sample_id   bam_path        subjectID       group   replicate       time
S1  /data/sample1.bam       mouse1  control A       pre
S2  /data/sample2.bam       mouse1  control A       post
S3  /data/sample3.bam       mouse2  treatment       B       pre
S4  /data/sample4.bam       mouse2  treatment       B       post
```

**Add BAM paths automatically:**

```bash
alleleflux-add-bam-path --metadata meta.tsv --output meta_bam.tsv --bam-dir /data/bams/
```

### MAG Mapping File

Maps contigs to MAG IDs.

| Column | Description |
|--------|-------------|
| `contig_name` | Full contig name from reference FASTA |
| `mag_id` | MAG identifier |

**Example:**

```text
contig_name mag_id
Bacteroides_001.fa_k141_1234        Bacteroides_001
Lachnospira_002.fa_k141_9012        Lachnospira_002
```

**Auto-generated by** `alleleflux-create-mag-mapping`

### GTDB Taxonomy File

GTDB-Tk classification (optional, for annotation).

| Property | Specification |
|----------|---------------|
| File | `gtdbtk.bac120.summary.tsv` |
| Required Columns | `user_genome` (MAG ID), `classification` (GTDB taxonomy) |

**Generate with GTDB-Tk:**

```bash
gtdbtk classify_wf --genome_dir mags/ --out_dir gtdbtk/ --cpus 16
```

## Profile Files (Generated by Workflow)

Profile files are created by `alleleflux-profile` and serve as input for analysis steps.

**Path:** `profiles/{sample}/{sample}_{mag}_profiled.tsv.gz`

**Columns:**

| Column | Type | Description |
|--------|------|-------------|
| `contig` | str | Contig identifier |
| `position` | int | 0-based genomic position |
| `ref_base` | str | Reference base (A/C/G/T) |
| `total_coverage` | int | Total read depth |
| `A`, `C`, `G`, `T` | int | Base counts |
| `gene_id` | str | Overlapping gene (empty if intergenic) |

## Pre-Run Checklist

Verify before running:

- ☑ BAM files sorted and indexed (`.bai` exists)
- ☑ Reference FASTA headers: `<MAG_ID>.fa_<contig_ID>`
- ☑ Prodigal genes match reference contigs
- ☑ Metadata has all required columns
- ☑ BAM paths are absolute paths
- ☑ Longitudinal: each `subjectID` has all timepoints
- ☑ `≥ min_sample_num` samples per group
- ☑ GTDB file contains all MAG IDs (if used)
- ☑ MAG mapping covers all contigs

## Common Issues

**Missing BAM indices**

```bash
# Index all BAM files
for bam in *.bam; do samtools index $bam; done
```

**Incorrect header format**

```bash
# Check FASTA headers
grep ">" combined.fasta | head
# Should match: >MAG_ID.fa_contigID
```

**Metadata path issues**

Use absolute paths for `bam_path`:

```bash
# Convert relative to absolute
sed -i "s|^|$(pwd)/|" metadata.tsv
```

See also: [Input Preparation](../usage/input_preparation.md), [Configuration Reference](configuration.md)