# FAQ & Troubleshooting

Frequently asked questions and solutions to common issues when using AlleleFlux.

---

## General Questions

### What is AlleleFlux?

AlleleFlux is an open-source bioinformatics toolkit for analyzing allele frequency changes in metagenomic data. It identifies genomic targets of natural selection in microbial communities by detecting parallel evolution through:

- **Parallelism and divergence scores** across MAGs, genes, and taxa
- **Multiple statistical tests** (t-tests, Mann-Whitney, LMM, CMH)
- **dN/dS ratios** using the Nei-Gojobori (1986) method with path averaging
- **Outlier gene detection** to pinpoint adaptive candidates

The workflow is implemented as a two-step Snakemake pipeline: Step 1 profiles samples and performs QC, while Step 2 runs statistical analyses and scoring.

### What types of data does AlleleFlux work with?

AlleleFlux works with **shotgun metagenomic data** aligned to metagenome-assembled genomes (MAGs). Required inputs include:

- **BAM files** -- reads aligned to a combined MAG reference (sorted and indexed)
- **Reference FASTA** -- combined MAG contigs with `<MAG_ID>.fa_<contig_ID>` headers
- **Prodigal gene predictions** -- nucleotide FASTA (`.fna`) for dN/dS and gene-level analysis
- **Sample metadata** -- TSV with sample IDs, BAM paths, group assignments, and (for longitudinal data) timepoints

AlleleFlux supports both **longitudinal** (multiple timepoints per subject) and **single-timepoint** (cross-sectional) study designs.

:::{seealso}
[Input Files Reference](inputs.md) for detailed format specifications.
:::

### What is the difference between "single" and "longitudinal" data types?

The `data_type` setting in your configuration controls which analyses are available:

| Feature | `single` | `longitudinal` |
|---------|----------|-----------------|
| Timepoints | One per subject | Multiple per subject |
| Paired tests | Not available | Paired t-test, Wilcoxon signed-rank |
| CMH test | Not available | Available (stratified by replicate) |
| LMM | Not available | Available (random effects for subjects) |
| dN/dS | Not applicable | Compares ancestral vs. derived timepoints |
| Focus timepoint | Not applicable | Required -- determines evolutionary direction |

Set `data_type: single` for cross-sectional studies where each subject is sampled once. Set `data_type: longitudinal` for time-series studies where the same subjects are sampled at multiple timepoints.

### How does AlleleFlux differ from other tools like SNPGenie, PAML, and HyPhy?

| Tool | Scope | Approach | Best For |
|------|-------|----------|----------|
| **AlleleFlux** | Metagenomic populations | Allele frequency changes across samples/timepoints | Parallel evolution in microbiomes |
| **SNPGenie** | Within-population | Polymorphism-based dN/dS from VCF | Single-population diversity |
| **PAML** | Phylogenetic lineages | Codon models on sequence alignments | Divergence between species |
| **HyPhy** | Phylogenetic branches | Hypothesis testing on phylogenies | Branch-specific selection |

AlleleFlux is specifically designed for **metagenomic** data with replicated experimental designs. It works with raw allele frequencies from BAM pileups (not consensus sequences or VCF files), enabling detection of subtle frequency shifts below fixation. Its statistical framework (parallelism/divergence scores, CMH stratification, LMM) is tailored for replicated microbiome experiments.

### What are MAGs and why does AlleleFlux use them?

**Metagenome-Assembled Genomes (MAGs)** are draft genomes reconstructed from metagenomic sequencing data by binning contigs that likely originate from the same organism. AlleleFlux uses MAGs because:

1. **Population-level resolution** -- MAGs represent microbial populations rather than individual strains, making allele frequency analysis meaningful
2. **Reference for alignment** -- BAM files are generated by aligning reads to MAG contigs, enabling position-level allele counting
3. **Biological unit of analysis** -- Scores are calculated per MAG, representing evolutionary changes in a specific microbial population
4. **Taxonomic context** -- MAGs can be classified with GTDB-Tk, allowing aggregation of scores at taxonomic levels

AlleleFlux supports MAGs from any assembler (MEGAHIT, metaSPAdes, etc.) as long as they follow the required header format.

### How many samples or replicates do I need?

A minimum of **4 samples per group** is required by default (configurable via `quality_control.min_sample_num`). More replicates improve statistical power:

| Replicates per Group | Power Level | Notes |
|----------------------|-------------|-------|
| 4-6 | Minimum | Basic detection of strong effects |
| 8-12 | Good | Recommended for most studies |
| >12 | Excellent | Detects subtle allele frequency shifts |

For **longitudinal studies**, each replicate/subject must have samples at all analyzed timepoints. For **CMH tests**, stratification by replicate requires at least 2-3 replicates per group.

:::{tip}
If you have fewer than 4 samples per group, lower `min_sample_num` in the config -- but be aware that statistical power will be reduced and false positives may increase.
:::

---

## Analysis Questions

### What statistical test should I use for my study design?

The choice depends on your experimental design. Use the decision flowchart in the [Statistical Tests Reference](statistical_tests.md#choosing-the-right-test):

| Study Design | Recommended Tests |
|--------------|-------------------|
| Cross-sectional, independent groups | Two-Sample Unpaired |
| Longitudinal, same subjects over time | Two-Sample Paired + LMM + CMH |
| Longitudinal, want to control for subject variation | LMM (best for repeated measures) |
| Longitudinal, want directional change detection | CMH (differential significance) |
| Within-group temporal changes | Single-Sample, LMM Across Time, CMH Across Time |

For comprehensive analysis, enable all available tests:

```yaml
analysis:
  use_significance_tests: true
  use_lmm: true
  use_cmh: true
```

Genes significant across multiple tests are the most robust candidates for selection.

:::{seealso}
[Statistical Tests Reference](statistical_tests.md) for detailed descriptions of each test and the decision flowchart.
:::

### What do parallelism and divergence scores mean?

- **Parallelism score**: The percentage of genomic positions showing statistically significant allele frequency changes within a group. High parallelism indicates **deterministic evolution** -- the same sites are changing consistently across replicates, suggesting natural selection rather than genetic drift.

- **Divergence score**: The percentage of genomic positions showing statistically significant differences in allele frequencies **between groups**. High divergence indicates **differential selection** between experimental conditions.

Both scores are expressed as percentages (0-100%) and are calculated at MAG, gene, and taxonomic levels.

**Example interpretation**: A gene with 15% parallelism score and 20% divergence score has 15% of its positions evolving consistently within a group and 20% of its positions evolving differently between groups.

### How is the CMH score different from other scores?

Most test scores use the simple formula:

```text
Score (%) = (Significant Sites / Total Sites) x 100
```

The CMH score instead uses **differential significance**:

```text
Score (%) = (Sites significant at focus timepoint BUT NOT at the other timepoint)
            / (Total sites present in BOTH timepoints) x 100
```

This captures **directional change** -- sites where allele frequencies became significantly different over time -- rather than just overall significance. A high CMH score means many sites gained significance specifically at the focus (later) timepoint, indicating active evolutionary change.

:::{seealso}
[Statistical Tests Reference -- CMH Score Calculation](statistical_tests.md#for-cmh-between-group) for the full mathematical formula.
:::

### What does the focus timepoint mean?

The **focus timepoint** represents the **derived** or **later** state in evolutionary comparisons:

- **For dN/dS**: Focus = derived timepoint; the other = ancestral. AlleleFlux calculates changes in the direction ancestral -> derived.
- **For CMH scores**: Measures sites that became significant at the focus timepoint but were not significant at the other timepoint.
- **Rule of thumb**: Always set `focus` to the **later** timepoint.

```yaml
timepoints_combinations:
  - timepoint: ["pre", "post"]
    focus: "post"       # "post" is the derived state
  - timepoint: ["day0", "day30"]
    focus: "day30"      # "day30" is the endpoint
```

If not specified, AlleleFlux defaults to the second timepoint in the list.

:::{seealso}
[Configuration Reference -- Focus Timepoint](configuration.md#analysis) for detailed configuration examples.
:::

### How is dN/dS calculated?

AlleleFlux uses the **Nei-Gojobori (1986) method** with **path averaging**:

1. **Potential sites**: For each codon, calculate expected synonymous (S) and non-synonymous (N) sites based on the genetic code
2. **Observed changes**: At statistically significant positions, compare ancestral and derived codons
3. **Path averaging**: When multiple positions change in a codon (k=2 or k=3), enumerate all mutational pathways and average the S/NS classification
4. **Ratios**: Calculate `pS = observed_S / potential_S`, `pN = observed_N / potential_N`, then `dN/dS = pN / pS`

**NaN values** occur when `pS = 0` (no synonymous substitutions observed), which is common for short genes or closely related timepoints.

:::{seealso}
[dN/dS Analysis Guide](../usage/dnds_analysis.md) for the full methodology, usage examples, and interpretation guide.
:::

### Can I run only part of the pipeline?

Yes. There are three approaches:

**1. Use `allele_analysis_only` to skip statistical tests:**

```yaml
analysis:
  allele_analysis_only: true   # Only runs profiling + allele frequency analysis
```

**2. Selectively enable/disable test modules:**

```yaml
analysis:
  use_significance_tests: true   # Two-sample and single-sample tests
  use_lmm: false                 # Disable LMM
  use_cmh: false                 # Disable CMH
```

**3. Run individual CLI tools standalone:**

Every analysis step is available as a separate command:

```bash
alleleflux-profile --help        # Profiling only
alleleflux-qc --help             # Quality control only
alleleflux-scores --help          # Score calculation only
alleleflux-dnds-from-timepoints --help  # dN/dS only
```

:::{seealso}
[CLI Reference](cli_reference.md) for the complete list of standalone tools.
:::

---

## Input / Output Questions

### What FASTA header format is required?

The combined reference FASTA must use the format `<MAG_ID>.fa_<contig_ID>`:

```text
>Bacteroides_001.fa_k141_12345
ATGCGATCGATCG...
>Bacteroides_001.fa_k141_67890
GCTAGCTAGCTAG...
>Lachnospira_002.fa_scaffold_1
ATCGATCGATCGA...
```

Use `alleleflux-create-mag-mapping` to combine individual MAG FASTAs and automatically generate correct headers and the mapping file:

```bash
alleleflux-create-mag-mapping \
    --dir mag_fastas/ --extension fa \
    --output-fasta combined.fasta --output-mapping mapping.tsv
```

:::{seealso}
[Input Files Reference](inputs.md) for all format specifications.
:::

### Can I use MAGs from different assemblers?

Yes. AlleleFlux is assembler-agnostic. MAGs from MEGAHIT, metaSPAdes, Flye, or any other assembler work as long as:

1. Individual MAG FASTA files are provided in a single directory
2. You run `alleleflux-create-mag-mapping` to create the combined reference with standardized headers
3. Reads are aligned to this combined reference (not the original per-assembler contigs)

The assembler-specific contig names are preserved after the `<MAG_ID>.fa_` prefix.

### What BAM quality is needed?

BAM files must be:

- **Sorted** by coordinate (`samtools sort`)
- **Indexed** with a `.bai` file (`samtools index`)
- Aligned to the **combined MAG reference FASTA** (the same file specified as `fasta_path` in config)

AlleleFlux applies quality filters during profiling:

| Parameter | Default | Effect |
|-----------|---------|--------|
| `min_base_quality` | 30 | Minimum Phred score per base |
| `min_mapping_quality` | 2 | Minimum MAPQ per read |
| `ignore_orphans` | true | Exclude unpaired reads |
| `ignore_overlaps` | true | Avoid double-counting overlapping pairs |

```bash
# Prepare BAM files
samtools sort input.bam -o input.sorted.bam
samtools index input.sorted.bam
```

:::{tip}
Keep `min_base_quality` at 30 for standard Illumina data. Lower to 20 for older sequencing platforms or if coverage is a concern.
:::

### Can I reuse profile files from a previous run?

Yes. Profile files (`profiles/{sample}/{sample}_{mag}_profiled.tsv.gz`) are the most computationally expensive output. Snakemake automatically detects existing profile files and skips re-profiling when you re-run the pipeline.

To re-analyze with different statistical parameters (e.g., different p-value thresholds or group comparisons) without re-profiling:

1. Keep the same `output.root_dir` in your config
2. Modify only the `analysis`, `statistics`, or `quality_control` sections
3. Re-run the pipeline -- Snakemake will reuse profiles and only recompute downstream steps

```bash
# Edit config with new parameters, then re-run
alleleflux run --config config_v2.yml --threads 16
```

### Where are the main results files?

Key output locations under `{root_dir}/{data_type}/`:

| Result | Path |
|--------|------|
| **MAG-level scores** | `scores/processed/combined/MAG/scores_{test}-{timepoints}-{groups}*.tsv` |
| **Gene-level scores** | `scores/processed/gene_scores_{timepoints}-{groups}/{test}_gene_scores.tsv.gz` |
| **Outlier genes** | `outlier_genes/{timepoints}-{groups}/{test}_outlier_genes.tsv.gz` |
| **Statistical test results** | `significance_tests/{test}_{timepoints}-{groups}/{mag}_*.tsv.gz` |
| **Eligibility tables** | `eligibility_table_{timepoints}-{groups}.tsv` |
| **QC metrics** | `QC/QC_{timepoints}-{groups}/{mag}_qc.tsv` |
| **dN/dS results** | `dnds/{timepoints}-{groups}/{mag}_gene_summary_ng86.tsv.gz` |

:::{seealso}
[Output Files Reference](outputs.md) for complete column descriptions and file format details.
:::

---

## Troubleshooting

### Installation Issues

#### rpy2 fails to install

**Symptom:**

```text
rpy2.rinterface_lib.openrlib.ffi.error: R_HOME not found
```

or compilation errors during `pip install rpy2`.

**Solution:** Install via conda, which handles R and rpy2 together:

```bash
conda install -c conda-forge rpy2 r-base r-tidyr
```

If using pip, ensure `R_HOME` is set:

```bash
export R_HOME=$(R RHOME)
pip install rpy2
```

#### pysam compilation error

**Symptom:**

```text
fatal error: htslib/sam.h: No such file or directory
```

**Solution:** Install pysam from conda-forge instead of building from source:

```bash
conda install -c conda-forge pysam
```

Alternatively, install `htslib` and `samtools` system-wide before running `pip install pysam`.

#### Snakemake version conflict

**Symptom:** Errors about missing Snakemake features or plugin architecture.

**Solution:** AlleleFlux requires Snakemake >= 8.0.0:

```bash
# Check current version
snakemake --version

# Upgrade
conda install -c conda-forge -c bioconda 'snakemake>=8.0.0'
```

Snakemake 8.x uses a plugin-based executor system. For cluster execution, also install:

```bash
conda install -c conda-forge snakemake-executor-plugin-cluster-generic
```

:::{seealso}
[Installation](../getting_started/installation.md) for complete setup instructions.
:::

---

### Runtime Errors

#### "Directory is locked"

**Symptom:**

```text
Error: Directory cannot be locked. ...
```

This occurs when a previous Snakemake run crashed or was interrupted.

**Solution:**

```bash
# Unlock the working directory
alleleflux run --config config.yml --unlock

# Then re-run normally
alleleflux run --config config.yml --threads 16
```

Or directly with Snakemake:

```bash
snakemake -s step1.smk --configfile config.yml --unlock
```

#### "No MAGs eligible" for a test

**Symptom:** The eligibility table shows all MAGs as ineligible for one or more test types, and downstream rules produce no output.

**Causes and solutions:**

1. **`breadth_threshold` too high** -- Lower from default `0.1` to `0.05`:

   ```yaml
   quality_control:
     breadth_threshold: 0.05
   ```

2. **`min_sample_num` too high** -- Lower the minimum required samples per group:

   ```yaml
   quality_control:
     min_sample_num: 3
   ```

3. **Incorrect BAM paths** -- Verify all paths in your metadata file exist:

   ```bash
   awk -F'\t' 'NR>1 {print $2}' metadata.tsv | xargs -I{} test -f {} || echo "MISSING: {}"
   ```

4. **Low sequencing depth** -- Samples may not have sufficient coverage of MAG contigs

Inspect the eligibility table directly:

```bash
column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv
```

#### Memory errors (out of memory / killed)

**Symptom:** Jobs are killed by the OS or SLURM with `MemoryError` or `SIGKILL`.

**Solutions:**

1. **Increase per-job memory in config:**

   ```yaml
   resources:
     mem_per_job: "16G"    # Increase from default 8G
   ```

2. **Reduce parallelism** to lower total memory usage:

   ```bash
   alleleflux run --config config.yml --threads 4 --jobs 2
   ```

3. **Memory estimates by dataset size:**

   | MAGs | Samples | Recommended Memory |
   |------|---------|-------------------|
   | < 100 | < 20 | 8 GB |
   | 100-500 | 20-50 | 16-32 GB |
   | 500+ | 50+ | 32-64 GB |

#### R / rpy2 errors during CMH tests

**Symptom:**

```text
RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr'
```

or other R-related errors during CMH rule execution.

**Solution:** Ensure R and the `tidyr` package are installed and accessible:

```bash
# Check R is available
R --version

# Check tidyr is installed
R -e "library(tidyr)"

# Install if missing
R -e "install.packages('tidyr', repos='https://cloud.r-project.org')"

# Or via conda (recommended)
conda install -c conda-forge r-tidyr
```

If R is installed but not found by rpy2, set `R_HOME`:

```bash
export R_HOME=$(R RHOME)
```

#### "Missing input files" in Snakemake

**Symptom:**

```text
MissingInputException in rule ...:
Missing input files for rule ...:
```

**Causes and solutions:**

1. **Incorrect paths in config** -- Verify all paths are absolute or correct relative to the working directory:

   ```yaml
   input:
     fasta_path: /absolute/path/to/combined.fasta    # Use absolute paths
     metadata_path: /absolute/path/to/metadata.tsv
   ```

2. **Missing upstream outputs** -- A previous step may have failed silently. Check logs:

   ```bash
   ls -la output/longitudinal/profiles/   # Are profile files present?
   ```

3. **File permission issues** -- Ensure read access to all input files and write access to the output directory.

#### Empty output files

**Symptom:** Output files exist but contain only headers or are 0 bytes.

**Causes and solutions:**

1. **QC thresholds too strict** -- Too many samples/positions filtered out:

   ```yaml
   quality_control:
     breadth_threshold: 0.05      # Lower from 0.1
     min_sample_num: 3            # Lower from 4
   ```

2. **Preprocessing filters too aggressive** -- Check `max_zero_count` and preprocessing settings:

   ```yaml
   statistics:
     max_zero_count: 6            # Increase to allow more zeros
     preprocess_between_groups: false   # Temporarily disable to test
   ```

3. **Check log files** for warnings:

   ```bash
   # Snakemake logs
   ls output/longitudinal/.snakemake/log/

   # Or check stderr of specific rules
   grep -r "WARNING\|ERROR" output/longitudinal/.snakemake/log/
   ```

---

### Performance Tips

#### Reuse profiles for re-analysis

Profile generation (Step 1) is the most time-consuming step. To re-analyze with different parameters without re-profiling:

1. Keep the same `output.root_dir`
2. Change only `analysis`, `statistics`, or `quality_control` settings
3. Re-run -- Snakemake automatically detects existing profiles and skips them

This is particularly useful for:
- Testing different p-value thresholds
- Adding or removing group comparisons
- Adjusting QC parameters

#### Optimal thread and memory settings

| Dataset Size | Threads | Memory per Job | Concurrent Jobs |
|-------------|---------|----------------|-----------------|
| Small (<20 samples, <100 MAGs) | 4-8 | 4G | 4 |
| Medium (20-50 samples, 100-500 MAGs) | 8-16 | 8G | 2-4 |
| Large (50+ samples, 500+ MAGs) | 16-32 | 16-32G | 1-2 |

```yaml
resources:
  threads_per_job: 16
  mem_per_job: "8G"
  time: "24:00:00"
```

#### Using SLURM profiles for HPC

For cluster environments, use a Snakemake SLURM profile:

```bash
# Create a SLURM profile
mkdir -p ~/.config/snakemake/slurm

# Run with the profile
alleleflux run --config config.yml -- --profile slurm
```

Or use the cluster-generic executor plugin:

```bash
alleleflux run --config config.yml -- \
    --executor cluster-generic \
    --cluster-generic-submit-cmd "sbatch --mem={resources.mem_per_job} --cpus-per-task={threads} --time={resources.time}"
```

---

### Common Warnings

#### "NaN p-values" in statistical test output

**Cause:** Insufficient data at some genomic positions to compute a valid test statistic. This is normal and expected for sparse metagenomic data.

**When to worry:**
- If a **small fraction** of positions have NaN p-values, this is typical -- sparse coverage or zero-variance positions
- If **most** positions have NaN p-values, check that samples have adequate coverage and that QC is not too aggressive

**Action:** NaN p-values are automatically excluded from score calculations. No intervention is needed unless the proportion is unexpectedly high.

#### "0 eligible MAGs" for a test type

**Cause:** No MAGs pass the eligibility criteria for a particular test. Common reasons:

- Too few samples passing QC per group (below `min_sample_num`)
- Paired tests require matching samples across timepoints -- missing timepoints disqualify MAGs
- Low genome coverage in most samples

**Action:**

```yaml
# Relax QC thresholds
quality_control:
  min_sample_num: 3           # Lower minimum
  breadth_threshold: 0.05     # Lower coverage requirement
```

Check which MAGs are close to eligibility:

```bash
column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv | head -20
```

#### "Reference base mismatch" in dN/dS or profile files

**Cause:** The reference base in profile data does not match the FASTA file. This typically means different reference FASTA files were used for alignment (BAM generation) and for AlleleFlux analysis.

**Action:**
- Ensure all BAM files were aligned to the **same** combined reference FASTA specified in `input.fasta_path`
- If BAMs were aligned to different references, re-align to the combined reference
- Verify with: `samtools view -H sample.bam | grep "^@SQ" | head` and compare contig names to your reference FASTA

---

## See Also

- [Installation](../getting_started/installation.md) -- Setup and dependency management
- [Configuration Reference](configuration.md) -- Full config options and examples
- [CLI Reference](cli_reference.md) -- All available command-line tools
- [Statistical Tests Reference](statistical_tests.md) -- Test descriptions and decision flowchart
- [Output Files Reference](outputs.md) -- Output file formats and locations
- [Interpreting Results](../usage/interpreting_results.md) -- How to interpret scores and findings
- [dN/dS Analysis Guide](../usage/dnds_analysis.md) -- Evolutionary rate analysis