Running the Workflow

AlleleFlux uses Snakemake to manage the workflow. This guide explains how to run the workflow and understand its components.

Workflow Overview

AlleleFlux is a unified Snakemake pipeline that performs:

  1. Profiling: Process BAM files to extract allele frequencies for each MAG

  2. Quality Control: Filter samples based on coverage breadth and depth

  3. Eligibility: Determine which MAGs qualify for each statistical test

  4. Analysis: Analyze allele frequencies across samples

  5. Statistical Testing: Run appropriate tests based on your experimental design

  6. Scoring: Calculate parallelism and divergence scores

  7. Outlier Detection: Identify genes with exceptionally high scores

  8. dN/dS Analysis: Calculate evolutionary rates for genes under selection

The pipeline automatically manages dependencies between these steps using Snakemake checkpoints.

Running with the AlleleFlux CLI

The recommended way to run AlleleFlux is using the alleleflux run command:

alleleflux run --config config.yml

This command handles Snakemake invocation and resource management automatically.

Common Options:

# Specify threads and memory for local execution
alleleflux run --config config.yml --threads 16 --memory 64G

# Use a cluster profile for HPC execution
alleleflux run --config config.yml --profile slurm_profile/

# Dry run to see what would be executed
alleleflux run --config config.yml --dry-run

# Unlock a stuck working directory after a crash
alleleflux run --config config.yml --unlock

# Pass additional Snakemake arguments
alleleflux run --config config.yml -- --forceall --reason

Running with Snakemake Directly

You can also run Snakemake directly for more control:

# Find the Snakefile location
alleleflux info

# Run Snakemake directly
snakemake \
    --snakefile $(python -c "from alleleflux.workflow import get_snakefile; print(get_snakefile())") \
    --configfile /path/to/config.yml \
    --cores 16

Running on HPC Clusters (SLURM)

AlleleFlux ships with a SLURM profile for cluster execution:

# Copy the bundled SLURM profile
cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./

# Edit slurm_profile/config.yaml to match your cluster
# Then run with the profile
alleleflux run --config config.yml --profile ./slurm_profile

The SLURM profile submits each rule as a separate job via sbatch, using the resource settings from your config file (resources.threads_per_job, resources.mem_per_job, resources.time).

For other schedulers (PBS, SGE, LSF), create a custom Snakemake profile following the Snakemake documentation.

Available Statistical Tests

AlleleFlux supports several statistical testing approaches, controlled by config flags:

Between-Group Tests

Compare allele frequencies between experimental groups (e.g., treatment vs. control):

  • Two-sample unpaired (t-test, Mann-Whitney U): For independent samples from different individuals

  • Two-sample paired (paired t-test, Wilcoxon signed-rank): For matched samples from the same individuals

  • LMM (Linear Mixed Models): For repeated measures with subject-level random effects

  • CMH (Cochran-Mantel-Haenszel): For stratified categorical analysis controlling for replicates

Within-Group Tests

Analyze changes within a single experimental group:

  • Single-sample (one-sample t-test): Tests if allele frequencies deviate from reference

  • LMM across time: Within-group temporal changes with random effects

  • CMH across time: Within-group categorical changes over time

Enable or disable tests in your config:

analysis:
  use_significance_tests: true  # Two-sample and single-sample tests
  use_lmm: true                 # Linear Mixed Models
  use_cmh: true                 # Cochran-Mantel-Haenszel tests

See Statistical Tests Reference for detailed methodology and score calculation formulas.

Customizing the Workflow

The AlleleFlux config file controls all aspects of the pipeline. Below is an example configuration matching the structure of config.template.yml:

run_name: "my_analysis"

analysis:
  data_type: "longitudinal"
  use_significance_tests: true
  use_lmm: true
  use_cmh: true
  timepoints_combinations:
    - timepoint: ["pre", "post"]
      focus: "post"
  groups_combinations:
    - ["treatment", "control"]

input:
  fasta_path: "/path/to/reference.fa"
  prodigal_path: "/path/to/genes.fna"
  metadata_path: "/path/to/metadata.tsv"
  gtdb_path: "/path/to/gtdbtk.tsv"
  mag_mapping_path: "/path/to/mag_mapping.tsv"

output:
  root_dir: "/path/to/output"

quality_control:
  min_sample_num: 4
  breadth_threshold: 0.1
  coverage_threshold: 1.0
  disable_zero_diff_filtering: false

statistics:
  filter_type: "t-test"
  p_value_threshold: 0.05

Generate a full config template with all options using:

alleleflux init --output config.yml

Data Type

The data_type setting is the most important configuration choice:

# For studies with multiple timepoints per subject
analysis:
  data_type: "longitudinal"
  timepoints_combinations:
    - timepoint: ["pre", "post"]
      focus: "post"
  groups_combinations:
    - ["treatment", "control"]

# For cross-sectional studies (one timepoint)
analysis:
  data_type: "single"
  timepoints_combinations:
    - timepoint: ["baseline"]
  groups_combinations:
    - ["disease", "healthy"]

Quality Control Thresholds

Adjust QC stringency based on your data quality:

quality_control:
  breadth_threshold: 0.1        # Minimum fraction of genome covered (0-1)
  coverage_threshold: 1.0       # Minimum average read depth
  min_sample_num: 4             # Minimum samples per group for statistical tests
  disable_zero_diff_filtering: false  # Keep constant positions if true

Tip

  • Start with breadth_threshold: 0.1 (10% coverage); increase for high-coverage data

  • Use min_sample_num: 4 as a minimum; more replicates improve statistical power

  • Set disable_zero_diff_filtering: false to focus on variable positions (recommended)

Profiling Parameters

Control read filtering during allele extraction:

profiling:
  min_base_quality: 30          # Phred score cutoff (30 = 99.9% accuracy)
  min_mapping_quality: 2        # MAPQ score cutoff
  ignore_orphans: false         # Include unpaired reads
  ignore_overlaps: true         # Avoid double-counting overlapping read pairs

Statistical Parameters

Fine-tune the preprocessing and testing:

statistics:
  filter_type: "t-test"          # "t-test", "wilcoxon", or "both"
  preprocess_between_groups: true # Filter positions before between-group tests
  preprocess_within_groups: true  # Filter positions before within-group tests
  max_zero_count: 4              # Max zero-frequency samples per position
  p_value_threshold: 0.05        # Significance threshold
  fdr_group_by_mag_id: false     # Apply FDR per-MAG (true) or genome-wide (false)
  min_positions_after_preprocess: 1  # Min positions to proceed with analysis

For a complete configuration reference, see Configuration Reference.

Resource Management

AlleleFlux provides flexible resource control for both local workstations and HPC clusters.

Resource Configuration

Set computational resources per job in your config file under the resources section:

resources:
  threads_per_job: 16           # Threads allocated per job
  mem_per_job: "8G"             # Memory per job (formats: "8G", "100G", "102400M")
  time: "24:00:00"              # Wall time limit (HH:MM:SS)

These values are used differently depending on your execution mode:

  • Local execution: Snakemake uses threads_per_job to determine how many jobs can run in parallel given the --threads limit. For example, with --threads 16 and threads_per_job: 4, up to 4 jobs run concurrently. Keep mem_per_job conservative to avoid out-of-memory errors.

  • Cluster execution (SLURM): These values are passed directly to sbatch as --cpus-per-task, --mem, and --time. SLURM allocates resources per node, so you can set higher values.

Per-Rule Resource Overrides

Power users can override resources for specific rules using the resources_override section:

resources_override:
  profile:
    threads_per_job: 8
    mem_per_job: "16G"
    time: "08:00:00"
  allele_analysis:
    mem_per_job: "32G"
  statistical_tests:
    threads_per_job: 4
    mem_per_job: "64G"
    time: "12:00:00"
  dnds_from_timepoints:
    threads_per_job: 1
    mem_per_job: "200G"
    time: "48:00:00"

SLURM Profile Setup

The bundled SLURM profile uses the cluster-generic executor plugin. To set it up:

  1. Copy the profile to your working directory:

    cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./
    
  2. Edit slurm_profile/config.yaml if needed. Key settings include:

    executor: cluster-generic
    jobs: 100                    # Max concurrent SLURM jobs
    latency-wait: 60            # Seconds to wait for NFS file propagation
    restart-times: 0            # Number of automatic retries on failure
    keep-going: True            # Continue with independent jobs if one fails
    rerun-incomplete: True      # Rerun jobs that were incomplete
    
  3. Run with the profile:

    alleleflux run --config config.yml --profile ./slurm_profile
    

SLURM job logs are written to logs/{run_name}/{date}/{rule}/ in your working directory.

Monitoring Progress

Checking Snakemake Output

While the pipeline runs, Snakemake prints progress to the terminal showing which rules are executing, completed, or pending. Key information includes:

  • Rule names and wildcards: Shows which MAG and timepoint/group combination is being processed

  • Job counts: Displays total jobs remaining and completed

  • Timestamps: Each log line is timestamped for tracking duration

Verifying the DAG Before Running

Use --dry-run to inspect the execution plan without running any jobs:

# See what rules would be executed
alleleflux run --config config.yml --dry-run

# With Snakemake directly, add --reason to see why each rule will run
snakemake --snakefile <snakefile_path> --configfile config.yml -n --reason

This is useful for verifying that your config changes produce the expected set of jobs before committing to a full run.

Checking Completed Rules

After a run (or during a partial run), you can check which output files exist:

# Check Snakemake's execution summary
snakemake --snakefile <snakefile_path> --configfile config.yml --summary

# List Snakemake log files for the run
ls output/longitudinal/.snakemake/log/

# View the most recent log
less output/longitudinal/.snakemake/log/$(ls -t output/longitudinal/.snakemake/log/ | head -1)

Restarting Failed Runs

Automatic Resume

Snakemake tracks completed steps via output file timestamps. If a run is interrupted (e.g., by a crash, Ctrl+C, or a cluster timeout), simply re-run the same command:

alleleflux run --config config.yml --threads 16

Snakemake will skip all rules whose output files already exist and are newer than their inputs, resuming from where it left off.

Unlocking Stuck Directories

If the pipeline was killed abruptly (e.g., kill -9, node failure), Snakemake may leave a lock file that prevents re-execution. You will see an error like:

Error: Directory cannot be locked. ...

Unlock the directory with:

alleleflux run --config config.yml --unlock

Then re-run the pipeline as normal.

Force-Rerunning Specific Rules

To rerun a specific rule and everything downstream of it:

# Rerun a specific rule
alleleflux run --config config.yml -- --forcerun allele_analysis

# Rerun everything from scratch
alleleflux run --config config.yml -- --forceall

# Rerun and see the reason each job is triggered
alleleflux run --config config.yml -- --forceall --reason

Note

--forcerun marks the specified rule’s outputs as outdated, causing Snakemake to regenerate them and all dependent downstream outputs. Use this when you have changed parameters that affect a specific step but the input files have not changed.

Output Structure

The workflow generates organized output directories:

{root_dir}/
└── {data_type}/                    # "single" or "longitudinal"
    ├── profiles/                   # Per-sample allele profiles
    ├── inputMetadata/              # Per-MAG sample metadata
    ├── QC/                         # Quality control metrics
    ├── eligibility_table_*.tsv     # MAG eligibility per test
    ├── allele_analysis/            # Allele frequency analysis
    ├── significance_tests/         # Statistical test results
    │   ├── two_sample_unpaired_*/
    │   ├── two_sample_paired_*/
    │   ├── single_sample_*/
    │   ├── lmm_*/
    │   └── cmh_*/
    ├── scores/                     # Parallelism & divergence scores
    │   ├── intermediate/           #   Per-MAG raw scores
    │   └── processed/
    │       ├── combined/           #   MAG-level & taxonomic summaries
    │       └── gene_scores_*/      #   Gene-level scores
    ├── outlier_genes/              # Outlier gene detection
    └── dnds_analysis/              # dN/dS results

See Output Files Reference for file format details.

Troubleshooting

Locked Working Directory

If the pipeline was interrupted, the working directory may be locked. You will see:

Error: Directory cannot be locked. Please make sure that no other Snakemake process
is trying to create the same files in the following directory: ...

Solution:

alleleflux run --config config.yml --unlock

rpy2 or R Errors

R is required for CMH tests. Common errors include:

ModuleNotFoundError: No module named 'rpy2' – Install rpy2 in your environment:

pip install rpy2

RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr' – Install the R tidyr package:

R -e "install.packages('tidyr', repos='https://cloud.r-project.org')"

R_HOME not set – Ensure R is on your PATH and R_HOME is set:

R --version
# If using conda, the environment.yml includes R dependencies automatically

Memory Errors (OOM)

Symptoms: Jobs killed with Killed, MemoryError, or SLURM OUT_OF_MEMORY status.

Solution: Increase per-job memory in the config:

resources:
  mem_per_job: "32G"

Or use per-rule overrides for memory-intensive steps:

resources_override:
  allele_analysis:
    mem_per_job: "64G"
  dnds_from_timepoints:
    mem_per_job: "200G"

For local execution, also limit parallelism to reduce total memory usage:

alleleflux run --config config.yml --threads 4 --memory 32G

No MAGs Eligible for Tests

Symptoms: Pipeline completes quickly with no statistical test output files, or log messages like No eligible MAGs found.

Check the eligibility table to see why MAGs were filtered:

column -t -s $'\t' output/longitudinal/eligibility_table_pre_post-treatment_control.tsv

Common causes and fixes:

Cause

Fix

breadth_threshold is too high for your coverage

Lower to 0.05 or 0.01

min_sample_num exceeds your number of replicates

Reduce to match your smallest group size

BAM paths in metadata are incorrect or files are missing

Verify paths with head metadata.tsv and check files exist

Contig names in BAM don’t match reference FASTA

Ensure BAM was aligned to the same FASTA specified in config

Missing or Empty Output Files

Symptoms: Expected output files are not created, or files exist but are empty (0 bytes).

# Check Snakemake logs for errors
ls output/longitudinal/.snakemake/log/
less output/longitudinal/.snakemake/log/$(ls -t output/longitudinal/.snakemake/log/ | head -1)

# Check SLURM job logs (if using cluster profile)
ls logs/<run_name>/

Common causes:

  • A required input file was not generated by a previous step. Check for errors in earlier rules.

  • The MAG had no positions passing QC filters. Try lowering coverage_threshold or breadth_threshold.

  • A worker process crashed silently. Check for ERROR lines in the log files.

Prodigal Gene ID Mismatches

Symptoms: dN/dS analysis produces no results, or gene scores are empty.

Ensure Prodigal was run on the same combined FASTA used as the reference:

# Check that gene headers match contig names in the reference
grep ">" prodigal_genes.fna | head -5
grep ">" combined_mags.fasta | head -5
# Gene headers should start with contig IDs from the FASTA

Config Validation Errors

Symptoms: Pipeline fails immediately with a configuration error.

# Validate your config against the template
alleleflux init --output config_template.yml
diff config.yml config_template.yml

Ensure all required keys are present (input.fasta_path, input.metadata_path, input.mag_mapping_path) and that data_type matches your timepoints_combinations structure.

SLURM Job Failures

Symptoms: Jobs submitted but fail with SLURM errors.

# Check SLURM job status
sacct -j <job_id> --format=JobID,State,ExitCode,MaxRSS,Elapsed

# Common SLURM states:
# TIMEOUT     -> Increase resources.time
# OUT_OF_MEMORY -> Increase resources.mem_per_job
# FAILED      -> Check the job .out log file in logs/

If jobs fail due to TIMEOUT, increase the wall time:

resources:
  time: "48:00:00"

Getting Help

  • Check the FAQ for common questions

  • Open a GitHub issue with your error log

  • Include alleleflux info output and the full error traceback