Running the Workflow¶
AlleleFlux uses Snakemake to manage the workflow. This guide explains how to run the workflow and understand its components.
Workflow Overview¶
AlleleFlux is a unified Snakemake pipeline that performs:
Profiling: Process BAM files to extract allele frequencies for each MAG
Quality Control: Filter samples based on coverage breadth and depth
Eligibility: Determine which MAGs qualify for each statistical test
Analysis: Analyze allele frequencies across samples
Statistical Testing: Run appropriate tests based on your experimental design
Scoring: Calculate parallelism and divergence scores
Outlier Detection: Identify genes with exceptionally high scores
dN/dS Analysis: Calculate evolutionary rates for genes under selection
The pipeline automatically manages dependencies between these steps using Snakemake checkpoints.
Running with the AlleleFlux CLI¶
The recommended way to run AlleleFlux is using the alleleflux run command:
alleleflux run --config config.yml
This command handles Snakemake invocation and resource management automatically.
Common Options:
# Specify threads and memory for local execution
alleleflux run --config config.yml --threads 16 --memory 64G
# Use a cluster profile for HPC execution
alleleflux run --config config.yml --profile slurm_profile/
# Dry run to see what would be executed
alleleflux run --config config.yml --dry-run
# Unlock a stuck working directory after a crash
alleleflux run --config config.yml --unlock
# Pass additional Snakemake arguments
alleleflux run --config config.yml -- --forceall --reason
Running with Snakemake Directly¶
You can also run Snakemake directly for more control:
# Find the Snakefile location
alleleflux info
# Run Snakemake directly
snakemake \
--snakefile $(python -c "from alleleflux.workflow import get_snakefile; print(get_snakefile())") \
--configfile /path/to/config.yml \
--cores 16
Running on HPC Clusters (SLURM)¶
AlleleFlux ships with a SLURM profile for cluster execution:
# Copy the bundled SLURM profile
cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./
# Edit slurm_profile/config.yaml to match your cluster
# Then run with the profile
alleleflux run --config config.yml --profile ./slurm_profile
The SLURM profile submits each rule as a separate job via sbatch, using the resource settings from your config file (resources.threads_per_job, resources.mem_per_job, resources.time).
For other schedulers (PBS, SGE, LSF), create a custom Snakemake profile following the Snakemake documentation.
Available Statistical Tests¶
AlleleFlux supports several statistical testing approaches, controlled by config flags:
Between-Group Tests¶
Compare allele frequencies between experimental groups (e.g., treatment vs. control):
Two-sample unpaired (t-test, Mann-Whitney U): For independent samples from different individuals
Two-sample paired (paired t-test, Wilcoxon signed-rank): For matched samples from the same individuals
LMM (Linear Mixed Models): For repeated measures with subject-level random effects
CMH (Cochran-Mantel-Haenszel): For stratified categorical analysis controlling for replicates
Within-Group Tests¶
Analyze changes within a single experimental group:
Single-sample (one-sample t-test): Tests if allele frequencies deviate from reference
LMM across time: Within-group temporal changes with random effects
CMH across time: Within-group categorical changes over time
Enable or disable tests in your config:
analysis:
use_significance_tests: true # Two-sample and single-sample tests
use_lmm: true # Linear Mixed Models
use_cmh: true # Cochran-Mantel-Haenszel tests
See Statistical Tests Reference for detailed methodology and score calculation formulas.
Customizing the Workflow¶
The AlleleFlux config file controls all aspects of the pipeline. Below is an example configuration matching the structure of config.template.yml:
run_name: "my_analysis"
analysis:
data_type: "longitudinal"
use_significance_tests: true
use_lmm: true
use_cmh: true
timepoints_combinations:
- timepoint: ["pre", "post"]
focus: "post"
groups_combinations:
- ["treatment", "control"]
input:
fasta_path: "/path/to/reference.fa"
prodigal_path: "/path/to/genes.fna"
metadata_path: "/path/to/metadata.tsv"
gtdb_path: "/path/to/gtdbtk.tsv"
mag_mapping_path: "/path/to/mag_mapping.tsv"
output:
root_dir: "/path/to/output"
quality_control:
min_sample_num: 4
breadth_threshold: 0.1
coverage_threshold: 1.0
disable_zero_diff_filtering: false
statistics:
filter_type: "t-test"
p_value_threshold: 0.05
Generate a full config template with all options using:
alleleflux init --output config.yml
Data Type¶
The data_type setting is the most important configuration choice:
# For studies with multiple timepoints per subject
analysis:
data_type: "longitudinal"
timepoints_combinations:
- timepoint: ["pre", "post"]
focus: "post"
groups_combinations:
- ["treatment", "control"]
# For cross-sectional studies (one timepoint)
analysis:
data_type: "single"
timepoints_combinations:
- timepoint: ["baseline"]
groups_combinations:
- ["disease", "healthy"]
Quality Control Thresholds¶
Adjust QC stringency based on your data quality:
quality_control:
breadth_threshold: 0.1 # Minimum fraction of genome covered (0-1)
coverage_threshold: 1.0 # Minimum average read depth
min_sample_num: 4 # Minimum samples per group for statistical tests
disable_zero_diff_filtering: false # Keep constant positions if true
Tip
Start with
breadth_threshold: 0.1(10% coverage); increase for high-coverage dataUse
min_sample_num: 4as a minimum; more replicates improve statistical powerSet
disable_zero_diff_filtering: falseto focus on variable positions (recommended)
Profiling Parameters¶
Control read filtering during allele extraction:
profiling:
min_base_quality: 30 # Phred score cutoff (30 = 99.9% accuracy)
min_mapping_quality: 2 # MAPQ score cutoff
ignore_orphans: false # Include unpaired reads
ignore_overlaps: true # Avoid double-counting overlapping read pairs
Statistical Parameters¶
Fine-tune the preprocessing and testing:
statistics:
filter_type: "t-test" # "t-test", "wilcoxon", or "both"
preprocess_between_groups: true # Filter positions before between-group tests
preprocess_within_groups: true # Filter positions before within-group tests
max_zero_count: 4 # Max zero-frequency samples per position
p_value_threshold: 0.05 # Significance threshold
fdr_group_by_mag_id: false # Apply FDR per-MAG (true) or genome-wide (false)
min_positions_after_preprocess: 1 # Min positions to proceed with analysis
For a complete configuration reference, see Configuration Reference.
Resource Management¶
AlleleFlux provides flexible resource control for both local workstations and HPC clusters.
Resource Configuration¶
Set computational resources per job in your config file under the resources section:
resources:
threads_per_job: 16 # Threads allocated per job
mem_per_job: "8G" # Memory per job (formats: "8G", "100G", "102400M")
time: "24:00:00" # Wall time limit (HH:MM:SS)
These values are used differently depending on your execution mode:
Local execution: Snakemake uses
threads_per_jobto determine how many jobs can run in parallel given the--threadslimit. For example, with--threads 16andthreads_per_job: 4, up to 4 jobs run concurrently. Keepmem_per_jobconservative to avoid out-of-memory errors.Cluster execution (SLURM): These values are passed directly to
sbatchas--cpus-per-task,--mem, and--time. SLURM allocates resources per node, so you can set higher values.
Per-Rule Resource Overrides¶
Power users can override resources for specific rules using the resources_override section:
resources_override:
profile:
threads_per_job: 8
mem_per_job: "16G"
time: "08:00:00"
allele_analysis:
mem_per_job: "32G"
statistical_tests:
threads_per_job: 4
mem_per_job: "64G"
time: "12:00:00"
dnds_from_timepoints:
threads_per_job: 1
mem_per_job: "200G"
time: "48:00:00"
SLURM Profile Setup¶
The bundled SLURM profile uses the cluster-generic executor plugin. To set it up:
Copy the profile to your working directory:
cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./
Edit
slurm_profile/config.yamlif needed. Key settings include:executor: cluster-generic jobs: 100 # Max concurrent SLURM jobs latency-wait: 60 # Seconds to wait for NFS file propagation restart-times: 0 # Number of automatic retries on failure keep-going: True # Continue with independent jobs if one fails rerun-incomplete: True # Rerun jobs that were incomplete
Run with the profile:
alleleflux run --config config.yml --profile ./slurm_profile
SLURM job logs are written to logs/{run_name}/{date}/{rule}/ in your working directory.
Monitoring Progress¶
Checking Snakemake Output¶
While the pipeline runs, Snakemake prints progress to the terminal showing which rules are executing, completed, or pending. Key information includes:
Rule names and wildcards: Shows which MAG and timepoint/group combination is being processed
Job counts: Displays total jobs remaining and completed
Timestamps: Each log line is timestamped for tracking duration
Verifying the DAG Before Running¶
Use --dry-run to inspect the execution plan without running any jobs:
# See what rules would be executed
alleleflux run --config config.yml --dry-run
# With Snakemake directly, add --reason to see why each rule will run
snakemake --snakefile <snakefile_path> --configfile config.yml -n --reason
This is useful for verifying that your config changes produce the expected set of jobs before committing to a full run.
Checking Completed Rules¶
After a run (or during a partial run), you can check which output files exist:
# Check Snakemake's execution summary
snakemake --snakefile <snakefile_path> --configfile config.yml --summary
# List Snakemake log files for the run
ls output/longitudinal/.snakemake/log/
# View the most recent log
less output/longitudinal/.snakemake/log/$(ls -t output/longitudinal/.snakemake/log/ | head -1)
Restarting Failed Runs¶
Automatic Resume¶
Snakemake tracks completed steps via output file timestamps. If a run is interrupted (e.g., by a crash, Ctrl+C, or a cluster timeout), simply re-run the same command:
alleleflux run --config config.yml --threads 16
Snakemake will skip all rules whose output files already exist and are newer than their inputs, resuming from where it left off.
Unlocking Stuck Directories¶
If the pipeline was killed abruptly (e.g., kill -9, node failure), Snakemake may leave a lock file that prevents re-execution. You will see an error like:
Error: Directory cannot be locked. ...
Unlock the directory with:
alleleflux run --config config.yml --unlock
Then re-run the pipeline as normal.
Force-Rerunning Specific Rules¶
To rerun a specific rule and everything downstream of it:
# Rerun a specific rule
alleleflux run --config config.yml -- --forcerun allele_analysis
# Rerun everything from scratch
alleleflux run --config config.yml -- --forceall
# Rerun and see the reason each job is triggered
alleleflux run --config config.yml -- --forceall --reason
Note
--forcerun marks the specified rule’s outputs as outdated, causing Snakemake to regenerate them and all dependent downstream outputs. Use this when you have changed parameters that affect a specific step but the input files have not changed.
Output Structure¶
The workflow generates organized output directories:
{root_dir}/
└── {data_type}/ # "single" or "longitudinal"
├── profiles/ # Per-sample allele profiles
├── inputMetadata/ # Per-MAG sample metadata
├── QC/ # Quality control metrics
├── eligibility_table_*.tsv # MAG eligibility per test
├── allele_analysis/ # Allele frequency analysis
├── significance_tests/ # Statistical test results
│ ├── two_sample_unpaired_*/
│ ├── two_sample_paired_*/
│ ├── single_sample_*/
│ ├── lmm_*/
│ └── cmh_*/
├── scores/ # Parallelism & divergence scores
│ ├── intermediate/ # Per-MAG raw scores
│ └── processed/
│ ├── combined/ # MAG-level & taxonomic summaries
│ └── gene_scores_*/ # Gene-level scores
├── outlier_genes/ # Outlier gene detection
└── dnds_analysis/ # dN/dS results
See Output Files Reference for file format details.
Troubleshooting¶
Locked Working Directory¶
If the pipeline was interrupted, the working directory may be locked. You will see:
Error: Directory cannot be locked. Please make sure that no other Snakemake process
is trying to create the same files in the following directory: ...
Solution:
alleleflux run --config config.yml --unlock
rpy2 or R Errors¶
R is required for CMH tests. Common errors include:
ModuleNotFoundError: No module named 'rpy2' – Install rpy2 in your environment:
pip install rpy2
RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr' – Install the R tidyr package:
R -e "install.packages('tidyr', repos='https://cloud.r-project.org')"
R_HOME not set – Ensure R is on your PATH and R_HOME is set:
R --version
# If using conda, the environment.yml includes R dependencies automatically
Memory Errors (OOM)¶
Symptoms: Jobs killed with Killed, MemoryError, or SLURM OUT_OF_MEMORY status.
Solution: Increase per-job memory in the config:
resources:
mem_per_job: "32G"
Or use per-rule overrides for memory-intensive steps:
resources_override:
allele_analysis:
mem_per_job: "64G"
dnds_from_timepoints:
mem_per_job: "200G"
For local execution, also limit parallelism to reduce total memory usage:
alleleflux run --config config.yml --threads 4 --memory 32G
No MAGs Eligible for Tests¶
Symptoms: Pipeline completes quickly with no statistical test output files, or log messages like No eligible MAGs found.
Check the eligibility table to see why MAGs were filtered:
column -t -s $'\t' output/longitudinal/eligibility_table_pre_post-treatment_control.tsv
Common causes and fixes:
Cause |
Fix |
|---|---|
|
Lower to |
|
Reduce to match your smallest group size |
BAM paths in metadata are incorrect or files are missing |
Verify paths with |
Contig names in BAM don’t match reference FASTA |
Ensure BAM was aligned to the same FASTA specified in config |
Missing or Empty Output Files¶
Symptoms: Expected output files are not created, or files exist but are empty (0 bytes).
# Check Snakemake logs for errors
ls output/longitudinal/.snakemake/log/
less output/longitudinal/.snakemake/log/$(ls -t output/longitudinal/.snakemake/log/ | head -1)
# Check SLURM job logs (if using cluster profile)
ls logs/<run_name>/
Common causes:
A required input file was not generated by a previous step. Check for errors in earlier rules.
The MAG had no positions passing QC filters. Try lowering
coverage_thresholdorbreadth_threshold.A worker process crashed silently. Check for
ERRORlines in the log files.
Prodigal Gene ID Mismatches¶
Symptoms: dN/dS analysis produces no results, or gene scores are empty.
Ensure Prodigal was run on the same combined FASTA used as the reference:
# Check that gene headers match contig names in the reference
grep ">" prodigal_genes.fna | head -5
grep ">" combined_mags.fasta | head -5
# Gene headers should start with contig IDs from the FASTA
Config Validation Errors¶
Symptoms: Pipeline fails immediately with a configuration error.
# Validate your config against the template
alleleflux init --output config_template.yml
diff config.yml config_template.yml
Ensure all required keys are present (input.fasta_path, input.metadata_path, input.mag_mapping_path) and that data_type matches your timepoints_combinations structure.
SLURM Job Failures¶
Symptoms: Jobs submitted but fail with SLURM errors.
# Check SLURM job status
sacct -j <job_id> --format=JobID,State,ExitCode,MaxRSS,Elapsed
# Common SLURM states:
# TIMEOUT -> Increase resources.time
# OUT_OF_MEMORY -> Increase resources.mem_per_job
# FAILED -> Check the job .out log file in logs/
If jobs fail due to TIMEOUT, increase the wall time:
resources:
time: "48:00:00"
Getting Help¶
Check the FAQ for common questions
Open a GitHub issue with your error log
Include
alleleflux infooutput and the full error traceback