FAQ & Troubleshooting¶

Frequently asked questions and solutions to common issues when using AlleleFlux.

General Questions¶

What is AlleleFlux?¶

AlleleFlux is an open-source bioinformatics toolkit for analyzing allele frequency changes in metagenomic data. It identifies genomic targets of natural selection in microbial communities by detecting parallel evolution through:

Parallelism and divergence scores across MAGs, genes, and taxa
Multiple statistical tests (t-tests, Mann-Whitney, LMM, CMH)
dN/dS ratios using the Nei-Gojobori (1986) method with path averaging
Outlier gene detection to pinpoint adaptive candidates

The workflow is implemented as a two-step Snakemake pipeline: Step 1 profiles samples and performs QC, while Step 2 runs statistical analyses and scoring.

What types of data does AlleleFlux work with?¶

AlleleFlux works with shotgun metagenomic data aligned to metagenome-assembled genomes (MAGs). Required inputs include:

BAM files – reads aligned to a combined MAG reference (sorted and indexed)
Reference FASTA – combined MAG contigs with <MAG_ID>.fa_<contig_ID> headers
Prodigal gene predictions – nucleotide FASTA (.fna) for dN/dS and gene-level analysis
Sample metadata – TSV with sample IDs, BAM paths, group assignments, and (for longitudinal data) timepoints

AlleleFlux supports both longitudinal (multiple timepoints per subject) and single-timepoint (cross-sectional) study designs.

See also

Input Files Reference for detailed format specifications.

What is the difference between “single” and “longitudinal” data types?¶

The data_type setting in your configuration controls which analyses are available:

Feature	`single`	`longitudinal`
Timepoints	One per subject	Multiple per subject
Paired tests	Not available	Paired t-test, Wilcoxon signed-rank
CMH test	Not available	Available (stratified by replicate)
LMM	Not available	Available (random effects for subjects)
dN/dS	Not applicable	Compares ancestral vs. derived timepoints
Focus timepoint	Not applicable	Required – determines evolutionary direction

Set data_type: single for cross-sectional studies where each subject is sampled once. Set data_type: longitudinal for time-series studies where the same subjects are sampled at multiple timepoints.

How does AlleleFlux differ from other tools like SNPGenie, PAML, and HyPhy?¶

Tool	Scope	Approach	Best For
AlleleFlux	Metagenomic populations	Allele frequency changes across samples/timepoints	Parallel evolution in microbiomes
SNPGenie	Within-population	Polymorphism-based dN/dS from VCF	Single-population diversity
PAML	Phylogenetic lineages	Codon models on sequence alignments	Divergence between species
HyPhy	Phylogenetic branches	Hypothesis testing on phylogenies	Branch-specific selection

AlleleFlux is specifically designed for metagenomic data with replicated experimental designs. It works with raw allele frequencies from BAM pileups (not consensus sequences or VCF files), enabling detection of subtle frequency shifts below fixation. Its statistical framework (parallelism/divergence scores, CMH stratification, LMM) is tailored for replicated microbiome experiments.

What are MAGs and why does AlleleFlux use them?¶

Metagenome-Assembled Genomes (MAGs) are draft genomes reconstructed from metagenomic sequencing data by binning contigs that likely originate from the same organism. AlleleFlux uses MAGs because:

Population-level resolution – MAGs represent microbial populations rather than individual strains, making allele frequency analysis meaningful
Reference for alignment – BAM files are generated by aligning reads to MAG contigs, enabling position-level allele counting
Biological unit of analysis – Scores are calculated per MAG, representing evolutionary changes in a specific microbial population
Taxonomic context – MAGs can be classified with GTDB-Tk, allowing aggregation of scores at taxonomic levels

AlleleFlux supports MAGs from any assembler (MEGAHIT, metaSPAdes, etc.) as long as they follow the required header format.

How many samples or replicates do I need?¶

A minimum of 4 samples per group is required by default (configurable via quality_control.min_sample_num). More replicates improve statistical power:

Replicates per Group	Power Level	Notes
4-6	Minimum	Basic detection of strong effects
8-12	Good	Recommended for most studies
>12	Excellent	Detects subtle allele frequency shifts

For longitudinal studies, each replicate/subject must have samples at all analyzed timepoints. For CMH tests, stratification by replicate requires at least 2-3 replicates per group.

Tip

If you have fewer than 4 samples per group, lower min_sample_num in the config – but be aware that statistical power will be reduced and false positives may increase.

Analysis Questions¶

What statistical test should I use for my study design?¶

The choice depends on your experimental design. Use the decision flowchart in the Statistical Tests Reference:

Study Design	Recommended Tests
Cross-sectional, independent groups	Two-Sample Unpaired
Longitudinal, same subjects over time	Two-Sample Paired + LMM + CMH
Longitudinal, want to control for subject variation	LMM (best for repeated measures)
Longitudinal, want directional change detection	CMH (differential significance)
Within-group temporal changes	Single-Sample, LMM Across Time, CMH Across Time

For comprehensive analysis, enable all available tests:

analysis:
  use_significance_tests: true
  use_lmm: true
  use_cmh: true

Genes significant across multiple tests are the most robust candidates for selection.

See also

Statistical Tests Reference for detailed descriptions of each test and the decision flowchart.

What do parallelism and divergence scores mean?¶

Parallelism score: The percentage of genomic positions showing statistically significant allele frequency changes within a group. High parallelism indicates deterministic evolution – the same sites are changing consistently across replicates, suggesting natural selection rather than genetic drift.
Divergence score: The percentage of genomic positions showing statistically significant differences in allele frequencies between groups. High divergence indicates differential selection between experimental conditions.

Both scores are expressed as percentages (0-100%) and are calculated at MAG, gene, and taxonomic levels.

Example interpretation: A gene with 15% parallelism score and 20% divergence score has 15% of its positions evolving consistently within a group and 20% of its positions evolving differently between groups.

How is the CMH score different from other scores?¶

Most test scores use the simple formula:

Score (%) = (Significant Sites / Total Sites) x 100

The CMH score instead uses differential significance:

Score (%) = (Sites significant at focus timepoint BUT NOT at the other timepoint)
            / (Total sites present in BOTH timepoints) x 100

This captures directional change – sites where allele frequencies became significantly different over time – rather than just overall significance. A high CMH score means many sites gained significance specifically at the focus (later) timepoint, indicating active evolutionary change.

See also

Statistical Tests Reference – CMH Score Calculation for the full mathematical formula.

What does the focus timepoint mean?¶

The focus timepoint represents the derived or later state in evolutionary comparisons:

For dN/dS: Focus = derived timepoint; the other = ancestral. AlleleFlux calculates changes in the direction ancestral -> derived.
For CMH scores: Measures sites that became significant at the focus timepoint but were not significant at the other timepoint.
Rule of thumb: Always set focus to the later timepoint.

timepoints_combinations:
  - timepoint: ["pre", "post"]
    focus: "post"       # "post" is the derived state
  - timepoint: ["day0", "day30"]
    focus: "day30"      # "day30" is the endpoint

If not specified, AlleleFlux defaults to the second timepoint in the list.

See also

Configuration Reference – Focus Timepoint for detailed configuration examples.

How is dN/dS calculated?¶

AlleleFlux uses the Nei-Gojobori (1986) method with path averaging:

Potential sites: For each codon, calculate expected synonymous (S) and non-synonymous (N) sites based on the genetic code
Observed changes: At statistically significant positions, compare ancestral and derived codons
Path averaging: When multiple positions change in a codon (k=2 or k=3), enumerate all mutational pathways and average the S/NS classification
Ratios: Calculate pS = observed_S / potential_S, pN = observed_N / potential_N, then dN/dS = pN / pS

NaN values occur when pS = 0 (no synonymous substitutions observed), which is common for short genes or closely related timepoints.

See also

dN/dS Analysis Guide for the full methodology, usage examples, and interpretation guide.

Can I run only part of the pipeline?¶

Yes. There are three approaches:

1. Use allele_analysis_only to skip statistical tests:

analysis:
  allele_analysis_only: true   # Only runs profiling + allele frequency analysis

2. Selectively enable/disable test modules:

analysis:
  use_significance_tests: true   # Two-sample and single-sample tests
  use_lmm: false                 # Disable LMM
  use_cmh: false                 # Disable CMH

3. Run individual CLI tools standalone:

Every analysis step is available as a separate command:

alleleflux-profile --help        # Profiling only
alleleflux-qc --help             # Quality control only
alleleflux-scores --help          # Score calculation only
alleleflux-dnds-from-timepoints --help  # dN/dS only

See also

CLI Reference for the complete list of standalone tools.

Input / Output Questions¶

What FASTA header format is required?¶

The combined reference FASTA must use the format <MAG_ID>.fa_<contig_ID>:

>Bacteroides_001.fa_k141_12345
ATGCGATCGATCG...
>Bacteroides_001.fa_k141_67890
GCTAGCTAGCTAG...
>Lachnospira_002.fa_scaffold_1
ATCGATCGATCGA...

Use alleleflux-create-mag-mapping to combine individual MAG FASTAs and automatically generate correct headers and the mapping file:

alleleflux-create-mag-mapping \
    --dir mag_fastas/ --extension fa \
    --output-fasta combined.fasta --output-mapping mapping.tsv

See also

Input Files Reference for all format specifications.

Can I use MAGs from different assemblers?¶

Yes. AlleleFlux is assembler-agnostic. MAGs from MEGAHIT, metaSPAdes, Flye, or any other assembler work as long as:

Individual MAG FASTA files are provided in a single directory
You run alleleflux-create-mag-mapping to create the combined reference with standardized headers
Reads are aligned to this combined reference (not the original per-assembler contigs)

The assembler-specific contig names are preserved after the <MAG_ID>.fa_ prefix.

What BAM quality is needed?¶

BAM files must be:

Sorted by coordinate (samtools sort)
Indexed with a .bai file (samtools index)
Aligned to the combined MAG reference FASTA (the same file specified as fasta_path in config)

AlleleFlux applies quality filters during profiling:

Parameter	Default	Effect
`min_base_quality`	30	Minimum Phred score per base
`min_mapping_quality`	2	Minimum MAPQ per read
`ignore_orphans`	true	Exclude unpaired reads
`ignore_overlaps`	true	Avoid double-counting overlapping pairs

# Prepare BAM files
samtools sort input.bam -o input.sorted.bam
samtools index input.sorted.bam

Tip

Keep min_base_quality at 30 for standard Illumina data. Lower to 20 for older sequencing platforms or if coverage is a concern.

Can I reuse profile files from a previous run?¶

Yes. Profile files (profiles/{sample}/{sample}_{mag}_profiled.tsv.gz) are the most computationally expensive output. Snakemake automatically detects existing profile files and skips re-profiling when you re-run the pipeline.

To re-analyze with different statistical parameters (e.g., different p-value thresholds or group comparisons) without re-profiling:

Keep the same output.root_dir in your config
Modify only the analysis, statistics, or quality_control sections
Re-run the pipeline – Snakemake will reuse profiles and only recompute downstream steps

# Edit config with new parameters, then re-run
alleleflux run --config config_v2.yml --threads 16

Where are the main results files?¶

Key output locations under {root_dir}/{data_type}/:

Result	Path
MAG-level scores	`scores/processed/combined/MAG/scores_{test}-{timepoints}-{groups}*.tsv`
Gene-level scores	`scores/processed/gene_scores_{timepoints}-{groups}/{test}_gene_scores.tsv.gz`
Outlier genes	`outlier_genes/{timepoints}-{groups}/{test}_outlier_genes.tsv.gz`
Statistical test results	`significance_tests/{test}_{timepoints}-{groups}/{mag}_*.tsv.gz`
Eligibility tables	`eligibility_table_{timepoints}-{groups}.tsv`
QC metrics	`QC/QC_{timepoints}-{groups}/{mag}_qc.tsv`
dN/dS results	`dnds/{timepoints}-{groups}/{mag}_gene_summary_ng86.tsv.gz`

See also

Output Files Reference for complete column descriptions and file format details.

Troubleshooting¶

Installation Issues¶

rpy2 fails to install¶

Symptom:

rpy2.rinterface_lib.openrlib.ffi.error: R_HOME not found

or compilation errors during pip install rpy2.

Solution: Install via conda, which handles R and rpy2 together:

conda install -c conda-forge rpy2 r-base r-tidyr

If using pip, ensure R_HOME is set:

export R_HOME=$(R RHOME)
pip install rpy2

pysam compilation error¶

Symptom:

fatal error: htslib/sam.h: No such file or directory

Solution: Install pysam from conda-forge instead of building from source:

conda install -c conda-forge pysam

Alternatively, install htslib and samtools system-wide before running pip install pysam.

Snakemake version conflict¶

Symptom: Errors about missing Snakemake features or plugin architecture.

Solution: AlleleFlux requires Snakemake >= 8.0.0:

# Check current version
snakemake --version

# Upgrade
conda install -c conda-forge -c bioconda 'snakemake>=8.0.0'

Snakemake 8.x uses a plugin-based executor system. For cluster execution, also install:

conda install -c conda-forge snakemake-executor-plugin-cluster-generic

See also

Installation for complete setup instructions.

Runtime Errors¶

“Directory is locked”¶

Symptom:

Error: Directory cannot be locked. ...

This occurs when a previous Snakemake run crashed or was interrupted.

Solution:

# Unlock the working directory
alleleflux run --config config.yml --unlock

# Then re-run normally
alleleflux run --config config.yml --threads 16

Or directly with Snakemake:

snakemake -s step1.smk --configfile config.yml --unlock

“No MAGs eligible” for a test¶

Symptom: The eligibility table shows all MAGs as ineligible for one or more test types, and downstream rules produce no output.

Causes and solutions:

breadth_threshold too high – Lower from default 0.1 to 0.05:
```
quality_control:
  breadth_threshold: 0.05
```
min_sample_num too high – Lower the minimum required samples per group:
```
quality_control:
  min_sample_num: 3
```

Incorrect BAM paths – Verify all paths in your metadata file exist:

awk -F'\t' 'NR>1 {print $2}' metadata.tsv | xargs -I{} test -f {} || echo "MISSING: {}"

Low sequencing depth – Samples may not have sufficient coverage of MAG contigs

Inspect the eligibility table directly:

column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv

Memory errors (out of memory / killed)¶

Symptom: Jobs are killed by the OS or SLURM with MemoryError or SIGKILL.

Solutions:

Increase per-job memory in config:

resources:
  mem_per_job: "16G"    # Increase from default 8G

Reduce parallelism to lower total memory usage:

alleleflux run --config config.yml --threads 4 --jobs 2

Memory estimates by dataset size:

MAGs

Samples

Recommended Memory

< 100

< 20

8 GB

100-500

20-50

16-32 GB

500+

50+

32-64 GB

MAGs	Samples	Recommended Memory
< 100	< 20	8 GB
100-500	20-50	16-32 GB
500+	50+	32-64 GB

R / rpy2 errors during CMH tests¶

Symptom:

RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr'

or other R-related errors during CMH rule execution.

Solution: Ensure R and the tidyr package are installed and accessible:

# Check R is available
R --version

# Check tidyr is installed
R -e "library(tidyr)"

# Install if missing
R -e "install.packages('tidyr', repos='https://cloud.r-project.org')"

# Or via conda (recommended)
conda install -c conda-forge r-tidyr

If R is installed but not found by rpy2, set R_HOME:

export R_HOME=$(R RHOME)

“Missing input files” in Snakemake¶

Symptom:

MissingInputException in rule ...:
Missing input files for rule ...:

Causes and solutions:

Incorrect paths in config – Verify all paths are absolute or correct relative to the working directory:

input:
  fasta_path: /absolute/path/to/combined.fasta    # Use absolute paths
  metadata_path: /absolute/path/to/metadata.tsv

Missing upstream outputs – A previous step may have failed silently. Check logs:
```
ls -la output/longitudinal/profiles/   # Are profile files present?
```
File permission issues – Ensure read access to all input files and write access to the output directory.

Empty output files¶

Symptom: Output files exist but contain only headers or are 0 bytes.

Causes and solutions:

QC thresholds too strict – Too many samples/positions filtered out:

quality_control:
  breadth_threshold: 0.05      # Lower from 0.1
  min_sample_num: 3            # Lower from 4

Preprocessing filters too aggressive – Check max_zero_count and preprocessing settings:

statistics:
  max_zero_count: 6            # Increase to allow more zeros
  preprocess_between_groups: false   # Temporarily disable to test

Check log files for warnings:

# Snakemake logs
ls output/longitudinal/.snakemake/log/

# Or check stderr of specific rules
grep -r "WARNING\|ERROR" output/longitudinal/.snakemake/log/

Performance Tips¶

Reuse profiles for re-analysis¶

Profile generation (Step 1) is the most time-consuming step. To re-analyze with different parameters without re-profiling:

Keep the same output.root_dir
Change only analysis, statistics, or quality_control settings
Re-run – Snakemake automatically detects existing profiles and skips them

This is particularly useful for:

Testing different p-value thresholds
Adding or removing group comparisons
Adjusting QC parameters

Optimal thread and memory settings¶

Dataset Size	Threads	Memory per Job	Concurrent Jobs
Small (<20 samples, <100 MAGs)	4-8	4G	4
Medium (20-50 samples, 100-500 MAGs)	8-16	8G	2-4
Large (50+ samples, 500+ MAGs)	16-32	16-32G	1-2

resources:
  threads_per_job: 16
  mem_per_job: "8G"
  time: "24:00:00"

Using SLURM profiles for HPC¶

For cluster environments, use a Snakemake SLURM profile:

# Create a SLURM profile
mkdir -p ~/.config/snakemake/slurm

# Run with the profile
alleleflux run --config config.yml -- --profile slurm

Or use the cluster-generic executor plugin:

alleleflux run --config config.yml -- \
    --executor cluster-generic \
    --cluster-generic-submit-cmd "sbatch --mem={resources.mem_per_job} --cpus-per-task={threads} --time={resources.time}"

Common Warnings¶

“NaN p-values” in statistical test output¶

Cause: Insufficient data at some genomic positions to compute a valid test statistic. This is normal and expected for sparse metagenomic data.

When to worry:

If a small fraction of positions have NaN p-values, this is typical – sparse coverage or zero-variance positions
If most positions have NaN p-values, check that samples have adequate coverage and that QC is not too aggressive

Action: NaN p-values are automatically excluded from score calculations. No intervention is needed unless the proportion is unexpectedly high.

“0 eligible MAGs” for a test type¶

Cause: No MAGs pass the eligibility criteria for a particular test. Common reasons:

Too few samples passing QC per group (below min_sample_num)
Paired tests require matching samples across timepoints – missing timepoints disqualify MAGs
Low genome coverage in most samples

Action:

# Relax QC thresholds
quality_control:
  min_sample_num: 3           # Lower minimum
  breadth_threshold: 0.05     # Lower coverage requirement

Check which MAGs are close to eligibility:

column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv | head -20

“Reference base mismatch” in dN/dS or profile files¶

Cause: The reference base in profile data does not match the FASTA file. This typically means different reference FASTA files were used for alignment (BAM generation) and for AlleleFlux analysis.

Action:

Ensure all BAM files were aligned to the same combined reference FASTA specified in input.fasta_path
If BAMs were aligned to different references, re-align to the combined reference
Verify with: samtools view -H sample.bam | grep "^@SQ" | head and compare contig names to your reference FASTA

FAQ & Troubleshooting¶

General Questions¶

What is AlleleFlux?¶

What types of data does AlleleFlux work with?¶

What is the difference between “single” and “longitudinal” data types?¶

How does AlleleFlux differ from other tools like SNPGenie, PAML, and HyPhy?¶

What are MAGs and why does AlleleFlux use them?¶

How many samples or replicates do I need?¶

Analysis Questions¶

What statistical test should I use for my study design?¶

What do parallelism and divergence scores mean?¶

How is the CMH score different from other scores?¶

What does the focus timepoint mean?¶

How is dN/dS calculated?¶

Can I run only part of the pipeline?¶

Input / Output Questions¶

What FASTA header format is required?¶

Can I use MAGs from different assemblers?¶

What BAM quality is needed?¶

Can I reuse profile files from a previous run?¶

Where are the main results files?¶

Troubleshooting¶

Installation Issues¶

rpy2 fails to install¶

pysam compilation error¶

Snakemake version conflict¶

Runtime Errors¶

“Directory is locked”¶

“No MAGs eligible” for a test¶

Memory errors (out of memory / killed)¶

R / rpy2 errors during CMH tests¶

“Missing input files” in Snakemake¶

Empty output files¶

Performance Tips¶

Reuse profiles for re-analysis¶

Optimal thread and memory settings¶

Using SLURM profiles for HPC¶

Common Warnings¶

“NaN p-values” in statistical test output¶

“0 eligible MAGs” for a test type¶

“Reference base mismatch” in dN/dS or profile files¶

See Also¶