FAQ & Troubleshooting

Frequently asked questions and solutions to common issues when using AlleleFlux.


General Questions

What is AlleleFlux?

AlleleFlux is an open-source bioinformatics toolkit for analyzing allele frequency changes in metagenomic data. It identifies genomic targets of natural selection in microbial communities by detecting parallel evolution through:

  • Parallelism and divergence scores across MAGs, genes, and taxa

  • Multiple statistical tests (t-tests, Mann-Whitney, LMM, CMH)

  • dN/dS ratios using the Nei-Gojobori (1986) method with path averaging

  • Outlier gene detection to pinpoint adaptive candidates

The workflow is implemented as a two-step Snakemake pipeline: Step 1 profiles samples and performs QC, while Step 2 runs statistical analyses and scoring.

What types of data does AlleleFlux work with?

AlleleFlux works with shotgun metagenomic data aligned to metagenome-assembled genomes (MAGs). Required inputs include:

  • BAM files – reads aligned to a combined MAG reference (sorted and indexed)

  • Reference FASTA – combined MAG contigs with <MAG_ID>.fa_<contig_ID> headers

  • Prodigal gene predictions – nucleotide FASTA (.fna) for dN/dS and gene-level analysis

  • Sample metadata – TSV with sample IDs, BAM paths, group assignments, and (for longitudinal data) timepoints

AlleleFlux supports both longitudinal (multiple timepoints per subject) and single-timepoint (cross-sectional) study designs.

See also

Input Files Reference for detailed format specifications.

What is the difference between “single” and “longitudinal” data types?

The data_type setting in your configuration controls which analyses are available:

Feature

single

longitudinal

Timepoints

One per subject

Multiple per subject

Paired tests

Not available

Paired t-test, Wilcoxon signed-rank

CMH test

Not available

Available (stratified by replicate)

LMM

Not available

Available (random effects for subjects)

dN/dS

Not applicable

Compares ancestral vs. derived timepoints

Focus timepoint

Not applicable

Required – determines evolutionary direction

Set data_type: single for cross-sectional studies where each subject is sampled once. Set data_type: longitudinal for time-series studies where the same subjects are sampled at multiple timepoints.

How does AlleleFlux differ from other tools like SNPGenie, PAML, and HyPhy?

Tool

Scope

Approach

Best For

AlleleFlux

Metagenomic populations

Allele frequency changes across samples/timepoints

Parallel evolution in microbiomes

SNPGenie

Within-population

Polymorphism-based dN/dS from VCF

Single-population diversity

PAML

Phylogenetic lineages

Codon models on sequence alignments

Divergence between species

HyPhy

Phylogenetic branches

Hypothesis testing on phylogenies

Branch-specific selection

AlleleFlux is specifically designed for metagenomic data with replicated experimental designs. It works with raw allele frequencies from BAM pileups (not consensus sequences or VCF files), enabling detection of subtle frequency shifts below fixation. Its statistical framework (parallelism/divergence scores, CMH stratification, LMM) is tailored for replicated microbiome experiments.

What are MAGs and why does AlleleFlux use them?

Metagenome-Assembled Genomes (MAGs) are draft genomes reconstructed from metagenomic sequencing data by binning contigs that likely originate from the same organism. AlleleFlux uses MAGs because:

  1. Population-level resolution – MAGs represent microbial populations rather than individual strains, making allele frequency analysis meaningful

  2. Reference for alignment – BAM files are generated by aligning reads to MAG contigs, enabling position-level allele counting

  3. Biological unit of analysis – Scores are calculated per MAG, representing evolutionary changes in a specific microbial population

  4. Taxonomic context – MAGs can be classified with GTDB-Tk, allowing aggregation of scores at taxonomic levels

AlleleFlux supports MAGs from any assembler (MEGAHIT, metaSPAdes, etc.) as long as they follow the required header format.

How many samples or replicates do I need?

A minimum of 4 samples per group is required by default (configurable via quality_control.min_sample_num). More replicates improve statistical power:

Replicates per Group

Power Level

Notes

4-6

Minimum

Basic detection of strong effects

8-12

Good

Recommended for most studies

>12

Excellent

Detects subtle allele frequency shifts

For longitudinal studies, each replicate/subject must have samples at all analyzed timepoints. For CMH tests, stratification by replicate requires at least 2-3 replicates per group.

Tip

If you have fewer than 4 samples per group, lower min_sample_num in the config – but be aware that statistical power will be reduced and false positives may increase.


Analysis Questions

What statistical test should I use for my study design?

The choice depends on your experimental design. Use the decision flowchart in the Statistical Tests Reference:

Study Design

Recommended Tests

Cross-sectional, independent groups

Two-Sample Unpaired

Longitudinal, same subjects over time

Two-Sample Paired + LMM + CMH

Longitudinal, want to control for subject variation

LMM (best for repeated measures)

Longitudinal, want directional change detection

CMH (differential significance)

Within-group temporal changes

Single-Sample, LMM Across Time, CMH Across Time

For comprehensive analysis, enable all available tests:

analysis:
  use_significance_tests: true
  use_lmm: true
  use_cmh: true

Genes significant across multiple tests are the most robust candidates for selection.

See also

Statistical Tests Reference for detailed descriptions of each test and the decision flowchart.

What do parallelism and divergence scores mean?

  • Parallelism score: The percentage of genomic positions showing statistically significant allele frequency changes within a group. High parallelism indicates deterministic evolution – the same sites are changing consistently across replicates, suggesting natural selection rather than genetic drift.

  • Divergence score: The percentage of genomic positions showing statistically significant differences in allele frequencies between groups. High divergence indicates differential selection between experimental conditions.

Both scores are expressed as percentages (0-100%) and are calculated at MAG, gene, and taxonomic levels.

Example interpretation: A gene with 15% parallelism score and 20% divergence score has 15% of its positions evolving consistently within a group and 20% of its positions evolving differently between groups.

How is the CMH score different from other scores?

Most test scores use the simple formula:

Score (%) = (Significant Sites / Total Sites) x 100

The CMH score instead uses differential significance:

Score (%) = (Sites significant at focus timepoint BUT NOT at the other timepoint)
            / (Total sites present in BOTH timepoints) x 100

This captures directional change – sites where allele frequencies became significantly different over time – rather than just overall significance. A high CMH score means many sites gained significance specifically at the focus (later) timepoint, indicating active evolutionary change.

See also

Statistical Tests Reference – CMH Score Calculation for the full mathematical formula.

What does the focus timepoint mean?

The focus timepoint represents the derived or later state in evolutionary comparisons:

  • For dN/dS: Focus = derived timepoint; the other = ancestral. AlleleFlux calculates changes in the direction ancestral -> derived.

  • For CMH scores: Measures sites that became significant at the focus timepoint but were not significant at the other timepoint.

  • Rule of thumb: Always set focus to the later timepoint.

timepoints_combinations:
  - timepoint: ["pre", "post"]
    focus: "post"       # "post" is the derived state
  - timepoint: ["day0", "day30"]
    focus: "day30"      # "day30" is the endpoint

If not specified, AlleleFlux defaults to the second timepoint in the list.

See also

Configuration Reference – Focus Timepoint for detailed configuration examples.

How is dN/dS calculated?

AlleleFlux uses the Nei-Gojobori (1986) method with path averaging:

  1. Potential sites: For each codon, calculate expected synonymous (S) and non-synonymous (N) sites based on the genetic code

  2. Observed changes: At statistically significant positions, compare ancestral and derived codons

  3. Path averaging: When multiple positions change in a codon (k=2 or k=3), enumerate all mutational pathways and average the S/NS classification

  4. Ratios: Calculate pS = observed_S / potential_S, pN = observed_N / potential_N, then dN/dS = pN / pS

NaN values occur when pS = 0 (no synonymous substitutions observed), which is common for short genes or closely related timepoints.

See also

dN/dS Analysis Guide for the full methodology, usage examples, and interpretation guide.

Can I run only part of the pipeline?

Yes. There are three approaches:

1. Use allele_analysis_only to skip statistical tests:

analysis:
  allele_analysis_only: true   # Only runs profiling + allele frequency analysis

2. Selectively enable/disable test modules:

analysis:
  use_significance_tests: true   # Two-sample and single-sample tests
  use_lmm: false                 # Disable LMM
  use_cmh: false                 # Disable CMH

3. Run individual CLI tools standalone:

Every analysis step is available as a separate command:

alleleflux-profile --help        # Profiling only
alleleflux-qc --help             # Quality control only
alleleflux-scores --help          # Score calculation only
alleleflux-dnds-from-timepoints --help  # dN/dS only

See also

CLI Reference for the complete list of standalone tools.


Input / Output Questions

What FASTA header format is required?

The combined reference FASTA must use the format <MAG_ID>.fa_<contig_ID>:

>Bacteroides_001.fa_k141_12345
ATGCGATCGATCG...
>Bacteroides_001.fa_k141_67890
GCTAGCTAGCTAG...
>Lachnospira_002.fa_scaffold_1
ATCGATCGATCGA...

Use alleleflux-create-mag-mapping to combine individual MAG FASTAs and automatically generate correct headers and the mapping file:

alleleflux-create-mag-mapping \
    --dir mag_fastas/ --extension fa \
    --output-fasta combined.fasta --output-mapping mapping.tsv

See also

Input Files Reference for all format specifications.

Can I use MAGs from different assemblers?

Yes. AlleleFlux is assembler-agnostic. MAGs from MEGAHIT, metaSPAdes, Flye, or any other assembler work as long as:

  1. Individual MAG FASTA files are provided in a single directory

  2. You run alleleflux-create-mag-mapping to create the combined reference with standardized headers

  3. Reads are aligned to this combined reference (not the original per-assembler contigs)

The assembler-specific contig names are preserved after the <MAG_ID>.fa_ prefix.

What BAM quality is needed?

BAM files must be:

  • Sorted by coordinate (samtools sort)

  • Indexed with a .bai file (samtools index)

  • Aligned to the combined MAG reference FASTA (the same file specified as fasta_path in config)

AlleleFlux applies quality filters during profiling:

Parameter

Default

Effect

min_base_quality

30

Minimum Phred score per base

min_mapping_quality

2

Minimum MAPQ per read

ignore_orphans

true

Exclude unpaired reads

ignore_overlaps

true

Avoid double-counting overlapping pairs

# Prepare BAM files
samtools sort input.bam -o input.sorted.bam
samtools index input.sorted.bam

Tip

Keep min_base_quality at 30 for standard Illumina data. Lower to 20 for older sequencing platforms or if coverage is a concern.

Can I reuse profile files from a previous run?

Yes. Profile files (profiles/{sample}/{sample}_{mag}_profiled.tsv.gz) are the most computationally expensive output. Snakemake automatically detects existing profile files and skips re-profiling when you re-run the pipeline.

To re-analyze with different statistical parameters (e.g., different p-value thresholds or group comparisons) without re-profiling:

  1. Keep the same output.root_dir in your config

  2. Modify only the analysis, statistics, or quality_control sections

  3. Re-run the pipeline – Snakemake will reuse profiles and only recompute downstream steps

# Edit config with new parameters, then re-run
alleleflux run --config config_v2.yml --threads 16

Where are the main results files?

Key output locations under {root_dir}/{data_type}/:

Result

Path

MAG-level scores

scores/processed/combined/MAG/scores_{test}-{timepoints}-{groups}*.tsv

Gene-level scores

scores/processed/gene_scores_{timepoints}-{groups}/{test}_gene_scores.tsv.gz

Outlier genes

outlier_genes/{timepoints}-{groups}/{test}_outlier_genes.tsv.gz

Statistical test results

significance_tests/{test}_{timepoints}-{groups}/{mag}_*.tsv.gz

Eligibility tables

eligibility_table_{timepoints}-{groups}.tsv

QC metrics

QC/QC_{timepoints}-{groups}/{mag}_qc.tsv

dN/dS results

dnds/{timepoints}-{groups}/{mag}_gene_summary_ng86.tsv.gz

See also

Output Files Reference for complete column descriptions and file format details.


Troubleshooting

Installation Issues

rpy2 fails to install

Symptom:

rpy2.rinterface_lib.openrlib.ffi.error: R_HOME not found

or compilation errors during pip install rpy2.

Solution: Install via conda, which handles R and rpy2 together:

conda install -c conda-forge rpy2 r-base r-tidyr

If using pip, ensure R_HOME is set:

export R_HOME=$(R RHOME)
pip install rpy2

pysam compilation error

Symptom:

fatal error: htslib/sam.h: No such file or directory

Solution: Install pysam from conda-forge instead of building from source:

conda install -c conda-forge pysam

Alternatively, install htslib and samtools system-wide before running pip install pysam.

Snakemake version conflict

Symptom: Errors about missing Snakemake features or plugin architecture.

Solution: AlleleFlux requires Snakemake >= 8.0.0:

# Check current version
snakemake --version

# Upgrade
conda install -c conda-forge -c bioconda 'snakemake>=8.0.0'

Snakemake 8.x uses a plugin-based executor system. For cluster execution, also install:

conda install -c conda-forge snakemake-executor-plugin-cluster-generic

See also

Installation for complete setup instructions.


Runtime Errors

“Directory is locked”

Symptom:

Error: Directory cannot be locked. ...

This occurs when a previous Snakemake run crashed or was interrupted.

Solution:

# Unlock the working directory
alleleflux run --config config.yml --unlock

# Then re-run normally
alleleflux run --config config.yml --threads 16

Or directly with Snakemake:

snakemake -s step1.smk --configfile config.yml --unlock

“No MAGs eligible” for a test

Symptom: The eligibility table shows all MAGs as ineligible for one or more test types, and downstream rules produce no output.

Causes and solutions:

  1. breadth_threshold too high – Lower from default 0.1 to 0.05:

    quality_control:
      breadth_threshold: 0.05
    
  2. min_sample_num too high – Lower the minimum required samples per group:

    quality_control:
      min_sample_num: 3
    
  3. Incorrect BAM paths – Verify all paths in your metadata file exist:

    awk -F'\t' 'NR>1 {print $2}' metadata.tsv | xargs -I{} test -f {} || echo "MISSING: {}"
    
  4. Low sequencing depth – Samples may not have sufficient coverage of MAG contigs

Inspect the eligibility table directly:

column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv

Memory errors (out of memory / killed)

Symptom: Jobs are killed by the OS or SLURM with MemoryError or SIGKILL.

Solutions:

  1. Increase per-job memory in config:

    resources:
      mem_per_job: "16G"    # Increase from default 8G
    
  2. Reduce parallelism to lower total memory usage:

    alleleflux run --config config.yml --threads 4 --jobs 2
    
  3. Memory estimates by dataset size:

    MAGs

    Samples

    Recommended Memory

    < 100

    < 20

    8 GB

    100-500

    20-50

    16-32 GB

    500+

    50+

    32-64 GB

R / rpy2 errors during CMH tests

Symptom:

RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr'

or other R-related errors during CMH rule execution.

Solution: Ensure R and the tidyr package are installed and accessible:

# Check R is available
R --version

# Check tidyr is installed
R -e "library(tidyr)"

# Install if missing
R -e "install.packages('tidyr', repos='https://cloud.r-project.org')"

# Or via conda (recommended)
conda install -c conda-forge r-tidyr

If R is installed but not found by rpy2, set R_HOME:

export R_HOME=$(R RHOME)

“Missing input files” in Snakemake

Symptom:

MissingInputException in rule ...:
Missing input files for rule ...:

Causes and solutions:

  1. Incorrect paths in config – Verify all paths are absolute or correct relative to the working directory:

    input:
      fasta_path: /absolute/path/to/combined.fasta    # Use absolute paths
      metadata_path: /absolute/path/to/metadata.tsv
    
  2. Missing upstream outputs – A previous step may have failed silently. Check logs:

    ls -la output/longitudinal/profiles/   # Are profile files present?
    
  3. File permission issues – Ensure read access to all input files and write access to the output directory.

Empty output files

Symptom: Output files exist but contain only headers or are 0 bytes.

Causes and solutions:

  1. QC thresholds too strict – Too many samples/positions filtered out:

    quality_control:
      breadth_threshold: 0.05      # Lower from 0.1
      min_sample_num: 3            # Lower from 4
    
  2. Preprocessing filters too aggressive – Check max_zero_count and preprocessing settings:

    statistics:
      max_zero_count: 6            # Increase to allow more zeros
      preprocess_between_groups: false   # Temporarily disable to test
    
  3. Check log files for warnings:

    # Snakemake logs
    ls output/longitudinal/.snakemake/log/
    
    # Or check stderr of specific rules
    grep -r "WARNING\|ERROR" output/longitudinal/.snakemake/log/
    

Performance Tips

Reuse profiles for re-analysis

Profile generation (Step 1) is the most time-consuming step. To re-analyze with different parameters without re-profiling:

  1. Keep the same output.root_dir

  2. Change only analysis, statistics, or quality_control settings

  3. Re-run – Snakemake automatically detects existing profiles and skips them

This is particularly useful for:

  • Testing different p-value thresholds

  • Adding or removing group comparisons

  • Adjusting QC parameters

Optimal thread and memory settings

Dataset Size

Threads

Memory per Job

Concurrent Jobs

Small (<20 samples, <100 MAGs)

4-8

4G

4

Medium (20-50 samples, 100-500 MAGs)

8-16

8G

2-4

Large (50+ samples, 500+ MAGs)

16-32

16-32G

1-2

resources:
  threads_per_job: 16
  mem_per_job: "8G"
  time: "24:00:00"

Using SLURM profiles for HPC

For cluster environments, use a Snakemake SLURM profile:

# Create a SLURM profile
mkdir -p ~/.config/snakemake/slurm

# Run with the profile
alleleflux run --config config.yml -- --profile slurm

Or use the cluster-generic executor plugin:

alleleflux run --config config.yml -- \
    --executor cluster-generic \
    --cluster-generic-submit-cmd "sbatch --mem={resources.mem_per_job} --cpus-per-task={threads} --time={resources.time}"

Common Warnings

“NaN p-values” in statistical test output

Cause: Insufficient data at some genomic positions to compute a valid test statistic. This is normal and expected for sparse metagenomic data.

When to worry:

  • If a small fraction of positions have NaN p-values, this is typical – sparse coverage or zero-variance positions

  • If most positions have NaN p-values, check that samples have adequate coverage and that QC is not too aggressive

Action: NaN p-values are automatically excluded from score calculations. No intervention is needed unless the proportion is unexpectedly high.

“0 eligible MAGs” for a test type

Cause: No MAGs pass the eligibility criteria for a particular test. Common reasons:

  • Too few samples passing QC per group (below min_sample_num)

  • Paired tests require matching samples across timepoints – missing timepoints disqualify MAGs

  • Low genome coverage in most samples

Action:

# Relax QC thresholds
quality_control:
  min_sample_num: 3           # Lower minimum
  breadth_threshold: 0.05     # Lower coverage requirement

Check which MAGs are close to eligibility:

column -t -s $'\t' output/longitudinal/eligibility_table_*.tsv | head -20

“Reference base mismatch” in dN/dS or profile files

Cause: The reference base in profile data does not match the FASTA file. This typically means different reference FASTA files were used for alignment (BAM generation) and for AlleleFlux analysis.

Action:

  • Ensure all BAM files were aligned to the same combined reference FASTA specified in input.fasta_path

  • If BAMs were aligned to different references, re-align to the combined reference

  • Verify with: samtools view -H sample.bam | grep "^@SQ" | head and compare contig names to your reference FASTA


See Also