Running the Workflow

AlleleFlux uses Snakemake to manage the workflow. This guide explains how to run the workflow and understand its components.

Workflow Overview

AlleleFlux is a unified Snakemake pipeline that performs:

  1. Profiling: Process BAM files to extract allele frequencies for each MAG

  2. Quality Control: Filter samples based on coverage breadth and depth

  3. Eligibility: Determine which MAGs qualify for each statistical test

  4. Analysis: Analyze allele frequencies across samples

  5. Statistical Testing: Run appropriate tests based on your experimental design

  6. Scoring: Calculate parallelism and divergence scores

  7. Outlier Detection: Identify genes with exceptionally high scores

The pipeline automatically manages dependencies between these steps using Snakemake checkpoints.

Running with the AlleleFlux CLI

The recommended way to run AlleleFlux is using the alleleflux run command:

alleleflux run --config config.yml

This command handles Snakemake invocation and resource management automatically.

Common Options:

# Specify threads and memory for local execution
alleleflux run --config config.yml --threads 16 --memory 64G

# Use a cluster profile for HPC execution
alleleflux run --config config.yml --profile slurm_profile/

# Dry run to see what would be executed
alleleflux run --config config.yml --dry-run

# Pass additional Snakemake arguments
alleleflux run --config config.yml -- --forceall --reason

Running with Snakemake Directly

You can also run Snakemake directly for more control:

cd /path/to/AlleleFlux/alleleflux/smk_workflow/alleleflux_pipeline
snakemake --configfile /path/to/config.yml --profile profile/ --cores 16

Available Statistical Tests

AlleleFlux supports several statistical testing approaches:

Two-Sample Tests

These tests compare allele frequencies between groups:

  • Unpaired: For comparing independent samples from different groups

  • Paired: For comparing matched samples (e.g., before and after treatment)

Single-Sample Tests

These tests analyze changes within a single group over time.

Linear Mixed Models (LMM)

LMM tests account for complex experimental designs with fixed and random effects, particularly useful for longitudinal data.

Cochran-Mantel-Haenszel (CMH) Tests

The CMH test is a stratified analysis of count data that:

  • Tests for association between allele changes and conditions while controlling for confounding factors

  • Provides position-by-position assessment of allele frequency changes

  • Is especially powerful for detecting parallel evolutionary changes

To enable/disable specific tests, modify these settings in the config.yml file:

analysis:
  use_significance_tests: true  # Enable/disable two-sample and single-sample tests
  use_lmm: true                # Enable/disable Linear Mixed Models
  use_cmh: true                # Enable/disable Cochran-Mantel-Haenszel tests

Customizing the Workflow

You can customize the workflow by editing the config.yml file (see Input Preparation for details). Key configuration options include:

# Data type: "single" for a single timepoint or "longitudinal" for multiple timepoints
data_type: "longitudinal"

# Input files
input:
  bam_dir: "/path/to/bam_files"  # For backward compatibility
  fasta_path: "/path/to/reference.fa"
  prodigal_path: "/path/to/genes.fna"
  metadata_path: "/path/to/metadata.tsv"  # Must include bam_path column

# Output directory
output:
  root_dir: "/path/to/output"

# Quality control settings
quality_control:
  min_coverage_breadth: 0.5
  disable_zero_diff_filtering: false
  min_sample_num: 4
  breadth_threshold: 0.1

# Analysis settings
analysis:
  use_significance_tests: true
  use_lmm: true
  use_cmh: true
  significance_threshold: 0.05

Advanced Usage

Running on a Compute Cluster

AlleleFlux supports running on a compute cluster through Snakemake’s cluster support. To run on a cluster:

  1. Create a cluster profile for your system (see Snakemake documentation)

  2. Run the workflow using your cluster profile:

# Using the AlleleFlux CLI with a cluster profile
alleleflux run --config config.yml --profile your_cluster_profile/

# Or using Snakemake directly
cd /path/to/AlleleFlux/alleleflux/smk_workflow/alleleflux_pipeline
snakemake --configfile /path/to/config.yml --profile your_cluster_profile/

Example cluster profiles are provided in the profile/ directory. You can adapt these for your specific computing environment (e.g., SLURM, PBS, SGE).

Output Files and Directories

The workflow generates several output directories:

output/
├── profiles/               # Sample profiles
├── metadata/               # MAG metadata
├── eligibility/            # Eligibility tables
├── allele_analysis/        # Allele frequency analysis results
├── significance_tests/     # Statistical test results
│   ├── lmm/                # Linear Mixed Model results
│   ├── cmh/                # Cochran-Mantel-Haenszel test results
│   └── preprocessed_two_sample/  # Preprocessed data for two-sample tests
├── scores/                 # Parallelism and divergence scores
│   ├── per_MAG/            # Scores per MAG
│   └── processed/          # Processed scores (taxonomic and combined)
└── outliers/               # Outlier gene detection results

Checkpoint Files

The workflow creates checkpoint files at various stages. If you need to restart a failed run, Snakemake will automatically pick up from the last successful checkpoint.

Troubleshooting

If you encounter issues when running the workflow:

  1. Check the Snakemake log files in the logs/ directory

  2. Ensure that all input files are in the correct format

  3. Verify that you have sufficient resources (memory, CPU, disk space)

  4. Check that all dependencies are installed correctly

Common Issues

  • Error in rpy2 or R dependencies: Ensure you have R installed and R packages required for CMH tests (e.g., stats)

  • Memory errors: Increase the memory allocation in your Snakemake profile

  • Missing files: Check paths in your config.yml file