FAQ & Troubleshooting¶
Frequently asked questions and solutions to common issues when using AlleleFlux.
General Questions¶
What is AlleleFlux?¶
AlleleFlux is an open-source bioinformatics toolkit for analyzing allele frequency changes in metagenomic data. It identifies genomic targets of natural selection in microbial communities by detecting parallel evolution through:
Parallelism and divergence scores across MAGs, genes, and taxa
Multiple statistical tests (t-tests, Mann-Whitney, LMM, CMH)
dN/dS ratios using the Nei-Gojobori (1986) method with path averaging
Outlier gene detection to pinpoint adaptive candidates
The workflow is implemented as a two-step Snakemake pipeline: Step 1 profiles samples and performs QC, while Step 2 runs statistical analyses and scoring.
What types of data does AlleleFlux work with?¶
AlleleFlux works with shotgun metagenomic data aligned to metagenome-assembled genomes (MAGs). Required inputs include:
BAM files – reads aligned to a combined MAG reference (sorted and indexed)
Reference FASTA – combined MAG contigs with
<MAG_ID>.fa_<contig_ID>headersProdigal gene predictions – nucleotide FASTA (
.fna) for dN/dS and gene-level analysisSample metadata – TSV with sample IDs, BAM paths, group assignments, and (for longitudinal data) timepoints
AlleleFlux supports both longitudinal (multiple timepoints per subject) and single-timepoint (cross-sectional) study designs.
See also
Input Files Reference for detailed format specifications.
What is the difference between “single” and “longitudinal” data types?¶
The data_type setting in your configuration controls which analyses are available:
Feature |
|
|
|---|---|---|
Timepoints |
One per subject |
Multiple per subject |
Paired tests |
Not available |
Paired t-test, Wilcoxon signed-rank |
CMH test |
Not available |
Available (stratified by replicate) |
LMM |
Not available |
Available (random effects for subjects) |
dN/dS |
Not applicable |
Compares ancestral vs. derived timepoints |
Focus timepoint |
Not applicable |
Required – determines evolutionary direction |
Set data_type: single for cross-sectional studies where each subject is sampled once. Set data_type: longitudinal for time-series studies where the same subjects are sampled at multiple timepoints.
How does AlleleFlux differ from other tools like SNPGenie, PAML, and HyPhy?¶
Tool |
Scope |
Approach |
Best For |
|---|---|---|---|
AlleleFlux |
Metagenomic populations |
Allele frequency changes across samples/timepoints |
Parallel evolution in microbiomes |
SNPGenie |
Within-population |
Polymorphism-based dN/dS from VCF |
Single-population diversity |
PAML |
Phylogenetic lineages |
Codon models on sequence alignments |
Divergence between species |
HyPhy |
Phylogenetic branches |
Hypothesis testing on phylogenies |
Branch-specific selection |
AlleleFlux is specifically designed for metagenomic data with replicated experimental designs. It works with raw allele frequencies from BAM pileups (not consensus sequences or VCF files), enabling detection of subtle frequency shifts below fixation. Its statistical framework (parallelism/divergence scores, CMH stratification, LMM) is tailored for replicated microbiome experiments.
What are MAGs and why does AlleleFlux use them?¶
Metagenome-Assembled Genomes (MAGs) are draft genomes reconstructed from metagenomic sequencing data by binning contigs that likely originate from the same organism. AlleleFlux uses MAGs because:
Population-level resolution – MAGs represent microbial populations rather than individual strains, making allele frequency analysis meaningful
Reference for alignment – BAM files are generated by aligning reads to MAG contigs, enabling position-level allele counting
Biological unit of analysis – Scores are calculated per MAG, representing evolutionary changes in a specific microbial population
Taxonomic context – MAGs can be classified with GTDB-Tk, allowing aggregation of scores at taxonomic levels
AlleleFlux supports MAGs from any assembler (MEGAHIT, metaSPAdes, etc.) as long as they follow the required header format.
How many samples or replicates do I need?¶
A minimum of 4 samples per group is required by default (configurable via quality_control.min_sample_num). More replicates improve statistical power:
Replicates per Group |
Power Level |
Notes |
|---|---|---|
4-6 |
Minimum |
Basic detection of strong effects |
8-12 |
Good |
Recommended for most studies |
>12 |
Excellent |
Detects subtle allele frequency shifts |
For longitudinal studies, each replicate/subject must have samples at all analyzed timepoints. For CMH tests, stratification by replicate requires at least 2-3 replicates per group.
Tip
If you have fewer than 4 samples per group, lower min_sample_num in the config – but be aware that statistical power will be reduced and false positives may increase.
Analysis Questions¶
What statistical test should I use for my study design?¶
The choice depends on your experimental design. Use the decision flowchart in the Statistical Tests Reference:
Study Design |
Recommended Tests |
|---|---|
Cross-sectional, independent groups |
Two-Sample Unpaired |
Longitudinal, same subjects over time |
Two-Sample Paired + LMM + CMH |
Longitudinal, want to control for subject variation |
LMM (best for repeated measures) |
Longitudinal, want directional change detection |
CMH (differential significance) |
Within-group temporal changes |
Single-Sample, LMM Across Time, CMH Across Time |
For comprehensive analysis, enable all available tests:
analysis:
use_significance_tests: true
use_lmm: true
use_cmh: true
Genes significant across multiple tests are the most robust candidates for selection.
See also
Statistical Tests Reference for detailed descriptions of each test and the decision flowchart.
What do parallelism and divergence scores mean?¶
Parallelism score: The percentage of genomic positions showing statistically significant allele frequency changes within a group. High parallelism indicates deterministic evolution – the same sites are changing consistently across replicates, suggesting natural selection rather than genetic drift.
Divergence score: The percentage of genomic positions showing statistically significant differences in allele frequencies between groups. High divergence indicates differential selection between experimental conditions.
Both scores are expressed as percentages (0-100%) and are calculated at MAG, gene, and taxonomic levels.
Example interpretation: A gene with 15% parallelism score and 20% divergence score has 15% of its positions evolving consistently within a group and 20% of its positions evolving differently between groups.
How is the CMH score different from other scores?¶
Most test scores use the simple formula:
Score (%) = (Significant Sites / Total Sites) x 100
The CMH score instead uses differential significance:
Score (%) = (Sites significant at focus timepoint BUT NOT at the other timepoint)
/ (Total sites present in BOTH timepoints) x 100
This captures directional change – sites where allele frequencies became significantly different over time – rather than just overall significance. A high CMH score means many sites gained significance specifically at the focus (later) timepoint, indicating active evolutionary change.
See also
Statistical Tests Reference – CMH Score Calculation for the full mathematical formula.
What does the focus timepoint mean?¶
The focus timepoint represents the derived or later state in evolutionary comparisons:
For dN/dS: Focus = derived timepoint; the other = ancestral. AlleleFlux calculates changes in the direction ancestral -> derived.
For CMH scores: Measures sites that became significant at the focus timepoint but were not significant at the other timepoint.
Rule of thumb: Always set
focusto the later timepoint.
timepoints_combinations:
- timepoint: ["pre", "post"]
focus: "post" # "post" is the derived state
- timepoint: ["day0", "day30"]
focus: "day30" # "day30" is the endpoint
If not specified, AlleleFlux defaults to the second timepoint in the list.
See also
Configuration Reference – Focus Timepoint for detailed configuration examples.
How is dN/dS calculated?¶
AlleleFlux uses the Nei-Gojobori (1986) method with path averaging:
Potential sites: For each codon, calculate expected synonymous (S) and non-synonymous (N) sites based on the genetic code
Observed changes: At statistically significant positions, compare ancestral and derived codons
Path averaging: When multiple positions change in a codon (k=2 or k=3), enumerate all mutational pathways and average the S/NS classification
Ratios: Calculate
pS = observed_S / potential_S,pN = observed_N / potential_N, thendN/dS = pN / pS
NaN values occur when pS = 0 (no synonymous substitutions observed), which is common for short genes or closely related timepoints.
See also
dN/dS Analysis Guide for the full methodology, usage examples, and interpretation guide.
Can I run only part of the pipeline?¶
Yes. There are three approaches:
1. Use allele_analysis_only to skip statistical tests:
analysis:
allele_analysis_only: true # Only runs profiling + allele frequency analysis
2. Selectively enable/disable test modules:
analysis:
use_significance_tests: true # Two-sample and single-sample tests
use_lmm: false # Disable LMM
use_cmh: false # Disable CMH
3. Run individual CLI tools standalone:
Every analysis step is available as a separate command:
alleleflux-profile --help # Profiling only
alleleflux-qc --help # Quality control only
alleleflux-scores --help # Score calculation only
alleleflux-dnds-from-timepoints --help # dN/dS only
See also
CLI Reference for the complete list of standalone tools.
Input / Output Questions¶
What FASTA header format is required?¶
The combined reference FASTA must use the format <MAG_ID>.fa_<contig_ID>:
>Bacteroides_001.fa_k141_12345
ATGCGATCGATCG...
>Bacteroides_001.fa_k141_67890
GCTAGCTAGCTAG...
>Lachnospira_002.fa_scaffold_1
ATCGATCGATCGA...
Use alleleflux-create-mag-mapping to combine individual MAG FASTAs and automatically generate correct headers and the mapping file:
alleleflux-create-mag-mapping \
--dir mag_fastas/ --extension fa \
--output-fasta combined.fasta --output-mapping mapping.tsv
See also
Input Files Reference for all format specifications.
Can I use MAGs from different assemblers?¶
Yes. AlleleFlux is assembler-agnostic. MAGs from MEGAHIT, metaSPAdes, Flye, or any other assembler work as long as:
Individual MAG FASTA files are provided in a single directory
You run
alleleflux-create-mag-mappingto create the combined reference with standardized headersReads are aligned to this combined reference (not the original per-assembler contigs)
The assembler-specific contig names are preserved after the <MAG_ID>.fa_ prefix.
What BAM quality is needed?¶
BAM files must be:
Sorted by coordinate (
samtools sort)Indexed with a
.baifile (samtools index)Aligned to the combined MAG reference FASTA (the same file specified as
fasta_pathin config)
AlleleFlux applies quality filters during profiling:
Parameter |
Default |
Effect |
|---|---|---|
|
30 |
Minimum Phred score per base |
|
2 |
Minimum MAPQ per read |
|
true |
Exclude unpaired reads |
|
true |
Avoid double-counting overlapping pairs |
# Prepare BAM files
samtools sort input.bam -o input.sorted.bam
samtools index input.sorted.bam
Tip
Keep min_base_quality at 30 for standard Illumina data. Lower to 20 for older sequencing platforms or if coverage is a concern.
Can I reuse profile files from a previous run?¶
Yes. Profile files (profiles/{sample}/{sample}_{mag}_profiled.tsv.gz) are the most computationally expensive output. Snakemake automatically detects existing profile files and skips re-profiling when you re-run the pipeline.
To re-analyze with different statistical parameters (e.g., different p-value thresholds or group comparisons) without re-profiling:
Keep the same
output.root_dirin your configModify only the
analysis,statistics, orquality_controlsectionsRe-run the pipeline – Snakemake will reuse profiles and only recompute downstream steps
# Edit config with new parameters, then re-run
alleleflux run --config config_v2.yml --threads 16
Where are the main results files?¶
Key output locations under {root_dir}/{data_type}/:
Result |
Path |
|---|---|
MAG-level scores |
|
Gene-level scores |
|
Outlier genes |
|
Statistical test results |
|
Eligibility tables |
|
QC metrics |
|
dN/dS results |
|
See also
Output Files Reference for complete column descriptions and file format details.
Troubleshooting¶
Installation Issues¶
rpy2 fails to install¶
Symptom:
rpy2.rinterface_lib.openrlib.ffi.error: R_HOME not found
or compilation errors during pip install rpy2.
Solution: Install via conda, which handles R and rpy2 together:
conda install -c conda-forge rpy2 r-base r-tidyr
If using pip, ensure R_HOME is set:
export R_HOME=$(R RHOME)
pip install rpy2
pysam compilation error¶
Symptom:
fatal error: htslib/sam.h: No such file or directory
Solution: Install pysam from conda-forge instead of building from source:
conda install -c conda-forge pysam
Alternatively, install htslib and samtools system-wide before running pip install pysam.
Snakemake version conflict¶
Symptom: Errors about missing Snakemake features or plugin architecture.
Solution: AlleleFlux requires Snakemake >= 8.0.0:
# Check current version
snakemake --version
# Upgrade
conda install -c conda-forge -c bioconda 'snakemake>=8.0.0'
Snakemake 8.x uses a plugin-based executor system. For cluster execution, also install:
conda install -c conda-forge snakemake-executor-plugin-cluster-generic
See also
Installation for complete setup instructions.
Runtime Errors¶
“Directory is locked”¶
Symptom:
Error: Directory cannot be locked. ...
This occurs when a previous Snakemake run crashed or was interrupted.
Solution:
# Unlock the working directory
alleleflux run --config config.yml --unlock
# Then re-run normally
alleleflux run --config config.yml --threads 16
Or directly with Snakemake:
snakemake -s step1.smk --configfile config.yml --unlock
“No MAGs eligible” for a test¶
Symptom: The eligibility table shows all MAGs as ineligible for one or more test types, and downstream rules produce no output.
Causes and solutions:
breadth_thresholdtoo high – Lower from default0.1to0.05:quality_control: breadth_threshold: 0.05
min_sample_numtoo high – Lower the minimum required samples per group:quality_control: min_sample_num: 3
Incorrect BAM paths – Verify all paths in your metadata file exist:
awk -F'\t' 'NR>1 {print $2}' metadata.tsv | xargs -I{} test -f {} || echo "MISSING: {}"
Low sequencing depth – Samples may not have sufficient coverage of MAG contigs
Inspect the eligibility table directly:
column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv
Memory errors (out of memory / killed)¶
Symptom: Jobs are killed by the OS or SLURM with MemoryError or SIGKILL.
Solutions:
Increase per-job memory in config:
resources: mem_per_job: "16G" # Increase from default 8G
Reduce parallelism to lower total memory usage:
alleleflux run --config config.yml --threads 4 --jobs 2
Memory estimates by dataset size:
MAGs
Samples
Recommended Memory
< 100
< 20
8 GB
100-500
20-50
16-32 GB
500+
50+
32-64 GB
R / rpy2 errors during CMH tests¶
Symptom:
RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr'
or other R-related errors during CMH rule execution.
Solution: Ensure R and the tidyr package are installed and accessible:
# Check R is available
R --version
# Check tidyr is installed
R -e "library(tidyr)"
# Install if missing
R -e "install.packages('tidyr', repos='https://cloud.r-project.org')"
# Or via conda (recommended)
conda install -c conda-forge r-tidyr
If R is installed but not found by rpy2, set R_HOME:
export R_HOME=$(R RHOME)
“Missing input files” in Snakemake¶
Symptom:
MissingInputException in rule ...:
Missing input files for rule ...:
Causes and solutions:
Incorrect paths in config – Verify all paths are absolute or correct relative to the working directory:
input: fasta_path: /absolute/path/to/combined.fasta # Use absolute paths metadata_path: /absolute/path/to/metadata.tsv
Missing upstream outputs – A previous step may have failed silently. Check logs:
ls -la output/longitudinal/profiles/ # Are profile files present?
File permission issues – Ensure read access to all input files and write access to the output directory.
Empty output files¶
Symptom: Output files exist but contain only headers or are 0 bytes.
Causes and solutions:
QC thresholds too strict – Too many samples/positions filtered out:
quality_control: breadth_threshold: 0.05 # Lower from 0.1 min_sample_num: 3 # Lower from 4
Preprocessing filters too aggressive – Check
max_zero_countand preprocessing settings:statistics: max_zero_count: 6 # Increase to allow more zeros preprocess_between_groups: false # Temporarily disable to test
Check log files for warnings:
# Snakemake logs ls output/longitudinal/.snakemake/log/ # Or check stderr of specific rules grep -r "WARNING\|ERROR" output/longitudinal/.snakemake/log/
Performance Tips¶
Reuse profiles for re-analysis¶
Profile generation (Step 1) is the most time-consuming step. To re-analyze with different parameters without re-profiling:
Keep the same
output.root_dirChange only
analysis,statistics, orquality_controlsettingsRe-run – Snakemake automatically detects existing profiles and skips them
This is particularly useful for:
Testing different p-value thresholds
Adding or removing group comparisons
Adjusting QC parameters
Optimal thread and memory settings¶
Dataset Size |
Threads |
Memory per Job |
Concurrent Jobs |
|---|---|---|---|
Small (<20 samples, <100 MAGs) |
4-8 |
4G |
4 |
Medium (20-50 samples, 100-500 MAGs) |
8-16 |
8G |
2-4 |
Large (50+ samples, 500+ MAGs) |
16-32 |
16-32G |
1-2 |
resources:
threads_per_job: 16
mem_per_job: "8G"
time: "24:00:00"
Using SLURM profiles for HPC¶
For cluster environments, use a Snakemake SLURM profile:
# Create a SLURM profile
mkdir -p ~/.config/snakemake/slurm
# Run with the profile
alleleflux run --config config.yml -- --profile slurm
Or use the cluster-generic executor plugin:
alleleflux run --config config.yml -- \
--executor cluster-generic \
--cluster-generic-submit-cmd "sbatch --mem={resources.mem_per_job} --cpus-per-task={threads} --time={resources.time}"
Common Warnings¶
“NaN p-values” in statistical test output¶
Cause: Insufficient data at some genomic positions to compute a valid test statistic. This is normal and expected for sparse metagenomic data.
When to worry:
If a small fraction of positions have NaN p-values, this is typical – sparse coverage or zero-variance positions
If most positions have NaN p-values, check that samples have adequate coverage and that QC is not too aggressive
Action: NaN p-values are automatically excluded from score calculations. No intervention is needed unless the proportion is unexpectedly high.
“0 eligible MAGs” for a test type¶
Cause: No MAGs pass the eligibility criteria for a particular test. Common reasons:
Too few samples passing QC per group (below
min_sample_num)Paired tests require matching samples across timepoints – missing timepoints disqualify MAGs
Low genome coverage in most samples
Action:
# Relax QC thresholds
quality_control:
min_sample_num: 3 # Lower minimum
breadth_threshold: 0.05 # Lower coverage requirement
Check which MAGs are close to eligibility:
column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv | head -20
“Reference base mismatch” in dN/dS or profile files¶
Cause: The reference base in profile data does not match the FASTA file. This typically means different reference FASTA files were used for alignment (BAM generation) and for AlleleFlux analysis.
Action:
Ensure all BAM files were aligned to the same combined reference FASTA specified in
input.fasta_pathIf BAMs were aligned to different references, re-align to the combined reference
Verify with:
samtools view -H sample.bam | grep "^@SQ" | headand compare contig names to your reference FASTA
See Also¶
Installation – Setup and dependency management
Configuration Reference – Full config options and examples
CLI Reference – All available command-line tools
Statistical Tests Reference – Test descriptions and decision flowchart
Output Files Reference – Output file formats and locations
Interpreting Results – How to interpret scores and findings
dN/dS Analysis Guide – Evolutionary rate analysis