# Running the Workflow AlleleFlux uses Snakemake to manage the workflow. This guide explains how to run the workflow and understand its components. ## Workflow Overview AlleleFlux is a unified Snakemake pipeline that performs: 1. **Profiling**: Process BAM files to extract allele frequencies for each MAG 2. **Quality Control**: Filter samples based on coverage breadth and depth 3. **Eligibility**: Determine which MAGs qualify for each statistical test 4. **Analysis**: Analyze allele frequencies across samples 5. **Statistical Testing**: Run appropriate tests based on your experimental design 6. **Scoring**: Calculate parallelism and divergence scores 7. **Outlier Detection**: Identify genes with exceptionally high scores 8. **dN/dS Analysis**: Calculate evolutionary rates for genes under selection The pipeline automatically manages dependencies between these steps using Snakemake checkpoints. ## Running with the AlleleFlux CLI The recommended way to run AlleleFlux is using the `alleleflux run` command: ```bash alleleflux run --config config.yml ``` This command handles Snakemake invocation and resource management automatically. **Common Options:** ```bash # Specify threads and memory for local execution alleleflux run --config config.yml --threads 16 --memory 64G # Use a cluster profile for HPC execution alleleflux run --config config.yml --profile slurm_profile/ # Dry run to see what would be executed alleleflux run --config config.yml --dry-run # Unlock a stuck working directory after a crash alleleflux run --config config.yml --unlock # Pass additional Snakemake arguments alleleflux run --config config.yml -- --forceall --reason ``` ## Running with Snakemake Directly You can also run Snakemake directly for more control: ```bash # Find the Snakefile location alleleflux info # Run Snakemake directly snakemake \ --snakefile $(python -c "from alleleflux.workflow import get_snakefile; print(get_snakefile())") \ --configfile /path/to/config.yml \ --cores 16 ``` ## Running on HPC Clusters (SLURM) AlleleFlux ships with a SLURM profile for cluster execution: ```bash # Copy the bundled SLURM profile cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./ # Edit slurm_profile/config.yaml to match your cluster # Then run with the profile alleleflux run --config config.yml --profile ./slurm_profile ``` The SLURM profile submits each rule as a separate job via `sbatch`, using the resource settings from your config file (`resources.threads_per_job`, `resources.mem_per_job`, `resources.time`). For other schedulers (PBS, SGE, LSF), create a custom Snakemake profile following the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles). ## Available Statistical Tests AlleleFlux supports several statistical testing approaches, controlled by config flags: ### Between-Group Tests Compare allele frequencies between experimental groups (e.g., treatment vs. control): - **Two-sample unpaired** (t-test, Mann-Whitney U): For independent samples from different individuals - **Two-sample paired** (paired t-test, Wilcoxon signed-rank): For matched samples from the same individuals - **LMM** (Linear Mixed Models): For repeated measures with subject-level random effects - **CMH** (Cochran-Mantel-Haenszel): For stratified categorical analysis controlling for replicates ### Within-Group Tests Analyze changes within a single experimental group: - **Single-sample** (one-sample t-test): Tests if allele frequencies deviate from reference - **LMM across time**: Within-group temporal changes with random effects - **CMH across time**: Within-group categorical changes over time Enable or disable tests in your config: ```yaml analysis: use_significance_tests: true # Two-sample and single-sample tests use_lmm: true # Linear Mixed Models use_cmh: true # Cochran-Mantel-Haenszel tests ``` See [Statistical Tests Reference](../reference/statistical_tests.md) for detailed methodology and score calculation formulas. ## Customizing the Workflow The AlleleFlux config file controls all aspects of the pipeline. Below is an example configuration matching the structure of `config.template.yml`: ```yaml run_name: "my_analysis" analysis: data_type: "longitudinal" use_significance_tests: true use_lmm: true use_cmh: true timepoints_combinations: - timepoint: ["pre", "post"] focus: "post" groups_combinations: - ["treatment", "control"] input: fasta_path: "/path/to/reference.fa" prodigal_path: "/path/to/genes.fna" metadata_path: "/path/to/metadata.tsv" gtdb_path: "/path/to/gtdbtk.tsv" mag_mapping_path: "/path/to/mag_mapping.tsv" output: root_dir: "/path/to/output" quality_control: min_sample_num: 4 breadth_threshold: 0.1 coverage_threshold: 1.0 disable_zero_diff_filtering: false statistics: filter_type: "t-test" p_value_threshold: 0.05 ``` Generate a full config template with all options using: ```bash alleleflux init --output config.yml ``` ### Data Type The `data_type` setting is the most important configuration choice: ```yaml # For studies with multiple timepoints per subject analysis: data_type: "longitudinal" timepoints_combinations: - timepoint: ["pre", "post"] focus: "post" groups_combinations: - ["treatment", "control"] # For cross-sectional studies (one timepoint) analysis: data_type: "single" timepoints_combinations: - timepoint: ["baseline"] groups_combinations: - ["disease", "healthy"] ``` ### Quality Control Thresholds Adjust QC stringency based on your data quality: ```yaml quality_control: breadth_threshold: 0.1 # Minimum fraction of genome covered (0-1) coverage_threshold: 1.0 # Minimum average read depth min_sample_num: 4 # Minimum samples per group for statistical tests disable_zero_diff_filtering: false # Keep constant positions if true ``` :::{tip} - Start with `breadth_threshold: 0.1` (10% coverage); increase for high-coverage data - Use `min_sample_num: 4` as a minimum; more replicates improve statistical power - Set `disable_zero_diff_filtering: false` to focus on variable positions (recommended) ::: ### Profiling Parameters Control read filtering during allele extraction: ```yaml profiling: min_base_quality: 30 # Phred score cutoff (30 = 99.9% accuracy) min_mapping_quality: 2 # MAPQ score cutoff ignore_orphans: false # Include unpaired reads ignore_overlaps: true # Avoid double-counting overlapping read pairs ``` ### Statistical Parameters Fine-tune the preprocessing and testing: ```yaml statistics: filter_type: "t-test" # "t-test", "wilcoxon", or "both" preprocess_between_groups: true # Filter positions before between-group tests preprocess_within_groups: true # Filter positions before within-group tests max_zero_count: 4 # Max zero-frequency samples per position p_value_threshold: 0.05 # Significance threshold fdr_group_by_mag_id: false # Apply FDR per-MAG (true) or genome-wide (false) min_positions_after_preprocess: 1 # Min positions to proceed with analysis ``` For a complete configuration reference, see [Configuration Reference](../reference/configuration.md). ## Resource Management AlleleFlux provides flexible resource control for both local workstations and HPC clusters. ### Resource Configuration Set computational resources per job in your config file under the `resources` section: ```yaml resources: threads_per_job: 16 # Threads allocated per job mem_per_job: "8G" # Memory per job (formats: "8G", "100G", "102400M") time: "24:00:00" # Wall time limit (HH:MM:SS) ``` These values are used differently depending on your execution mode: - **Local execution**: Snakemake uses `threads_per_job` to determine how many jobs can run in parallel given the `--threads` limit. For example, with `--threads 16` and `threads_per_job: 4`, up to 4 jobs run concurrently. Keep `mem_per_job` conservative to avoid out-of-memory errors. - **Cluster execution (SLURM)**: These values are passed directly to `sbatch` as `--cpus-per-task`, `--mem`, and `--time`. SLURM allocates resources per node, so you can set higher values. ### Per-Rule Resource Overrides Power users can override resources for specific rules using the `resources_override` section: ```yaml resources_override: profile: threads_per_job: 8 mem_per_job: "16G" time: "08:00:00" allele_analysis: mem_per_job: "32G" statistical_tests: threads_per_job: 4 mem_per_job: "64G" time: "12:00:00" dnds_from_timepoints: threads_per_job: 1 mem_per_job: "200G" time: "48:00:00" ``` ### SLURM Profile Setup The bundled SLURM profile uses the `cluster-generic` executor plugin. To set it up: 1. Copy the profile to your working directory: ```bash cp -r $(python -c "import alleleflux; print(alleleflux.__path__[0])")/smk_workflow/slurm_profile ./ ``` 2. Edit `slurm_profile/config.yaml` if needed. Key settings include: ```yaml executor: cluster-generic jobs: 100 # Max concurrent SLURM jobs latency-wait: 60 # Seconds to wait for NFS file propagation restart-times: 0 # Number of automatic retries on failure keep-going: True # Continue with independent jobs if one fails rerun-incomplete: True # Rerun jobs that were incomplete ``` 3. Run with the profile: ```bash alleleflux run --config config.yml --profile ./slurm_profile ``` SLURM job logs are written to `logs/{run_name}/{date}/{rule}/` in your working directory. ## Monitoring Progress ### Checking Snakemake Output While the pipeline runs, Snakemake prints progress to the terminal showing which rules are executing, completed, or pending. Key information includes: - **Rule names and wildcards**: Shows which MAG and timepoint/group combination is being processed - **Job counts**: Displays total jobs remaining and completed - **Timestamps**: Each log line is timestamped for tracking duration ### Verifying the DAG Before Running Use `--dry-run` to inspect the execution plan without running any jobs: ```bash # See what rules would be executed alleleflux run --config config.yml --dry-run # With Snakemake directly, add --reason to see why each rule will run snakemake --snakefile --configfile config.yml -n --reason ``` This is useful for verifying that your config changes produce the expected set of jobs before committing to a full run. ### Checking Completed Rules After a run (or during a partial run), you can check which output files exist: ```bash # Check Snakemake's execution summary snakemake --snakefile --configfile config.yml --summary # List Snakemake log files for the run ls output/longitudinal/.snakemake/log/ # View the most recent log less output/longitudinal/.snakemake/log/$(ls -t output/longitudinal/.snakemake/log/ | head -1) ``` ## Restarting Failed Runs ### Automatic Resume Snakemake tracks completed steps via output file timestamps. If a run is interrupted (e.g., by a crash, `Ctrl+C`, or a cluster timeout), simply re-run the same command: ```bash alleleflux run --config config.yml --threads 16 ``` Snakemake will skip all rules whose output files already exist and are newer than their inputs, resuming from where it left off. ### Unlocking Stuck Directories If the pipeline was killed abruptly (e.g., `kill -9`, node failure), Snakemake may leave a lock file that prevents re-execution. You will see an error like: ``` Error: Directory cannot be locked. ... ``` Unlock the directory with: ```bash alleleflux run --config config.yml --unlock ``` Then re-run the pipeline as normal. ### Force-Rerunning Specific Rules To rerun a specific rule and everything downstream of it: ```bash # Rerun a specific rule alleleflux run --config config.yml -- --forcerun allele_analysis # Rerun everything from scratch alleleflux run --config config.yml -- --forceall # Rerun and see the reason each job is triggered alleleflux run --config config.yml -- --forceall --reason ``` :::{note} `--forcerun` marks the specified rule's outputs as outdated, causing Snakemake to regenerate them and all dependent downstream outputs. Use this when you have changed parameters that affect a specific step but the input files have not changed. ::: ## Output Structure The workflow generates organized output directories: ```text {root_dir}/ └── {data_type}/ # "single" or "longitudinal" ├── profiles/ # Per-sample allele profiles ├── inputMetadata/ # Per-MAG sample metadata ├── QC/ # Quality control metrics ├── eligibility_table_*.tsv # MAG eligibility per test ├── allele_analysis/ # Allele frequency analysis ├── significance_tests/ # Statistical test results │ ├── two_sample_unpaired_*/ │ ├── two_sample_paired_*/ │ ├── single_sample_*/ │ ├── lmm_*/ │ └── cmh_*/ ├── scores/ # Parallelism & divergence scores │ ├── intermediate/ # Per-MAG raw scores │ └── processed/ │ ├── combined/ # MAG-level & taxonomic summaries │ └── gene_scores_*/ # Gene-level scores ├── outlier_genes/ # Outlier gene detection └── dnds_analysis/ # dN/dS results ``` See [Output Files Reference](../reference/outputs.md) for file format details. ## Troubleshooting ### Locked Working Directory If the pipeline was interrupted, the working directory may be locked. You will see: ``` Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory: ... ``` **Solution:** ```bash alleleflux run --config config.yml --unlock ``` ### rpy2 or R Errors R is required for CMH tests. Common errors include: **`ModuleNotFoundError: No module named 'rpy2'`** -- Install rpy2 in your environment: ```bash pip install rpy2 ``` **`RRuntimeError: Error in library(tidyr) : there is no package called 'tidyr'`** -- Install the R `tidyr` package: ```bash R -e "install.packages('tidyr', repos='https://cloud.r-project.org')" ``` **`R_HOME` not set** -- Ensure R is on your `PATH` and `R_HOME` is set: ```bash R --version # If using conda, the environment.yml includes R dependencies automatically ``` ### Memory Errors (OOM) **Symptoms**: Jobs killed with `Killed`, `MemoryError`, or SLURM `OUT_OF_MEMORY` status. **Solution**: Increase per-job memory in the config: ```yaml resources: mem_per_job: "32G" ``` Or use per-rule overrides for memory-intensive steps: ```yaml resources_override: allele_analysis: mem_per_job: "64G" dnds_from_timepoints: mem_per_job: "200G" ``` For local execution, also limit parallelism to reduce total memory usage: ```bash alleleflux run --config config.yml --threads 4 --memory 32G ``` ### No MAGs Eligible for Tests **Symptoms**: Pipeline completes quickly with no statistical test output files, or log messages like `No eligible MAGs found`. Check the eligibility table to see why MAGs were filtered: ```bash column -t -s $'\t' output/longitudinal/eligibility_table_pre_post-treatment_control.tsv ``` **Common causes and fixes:** | Cause | Fix | |-------|-----| | `breadth_threshold` is too high for your coverage | Lower to `0.05` or `0.01` | | `min_sample_num` exceeds your number of replicates | Reduce to match your smallest group size | | BAM paths in metadata are incorrect or files are missing | Verify paths with `head metadata.tsv` and check files exist | | Contig names in BAM don't match reference FASTA | Ensure BAM was aligned to the same FASTA specified in config | ### Missing or Empty Output Files **Symptoms**: Expected output files are not created, or files exist but are empty (0 bytes). ```bash # Check Snakemake logs for errors ls output/longitudinal/.snakemake/log/ less output/longitudinal/.snakemake/log/$(ls -t output/longitudinal/.snakemake/log/ | head -1) # Check SLURM job logs (if using cluster profile) ls logs// ``` **Common causes:** - A required input file was not generated by a previous step. Check for errors in earlier rules. - The MAG had no positions passing QC filters. Try lowering `coverage_threshold` or `breadth_threshold`. - A worker process crashed silently. Check for `ERROR` lines in the log files. ### Prodigal Gene ID Mismatches **Symptoms**: dN/dS analysis produces no results, or gene scores are empty. Ensure Prodigal was run on the same combined FASTA used as the reference: ```bash # Check that gene headers match contig names in the reference grep ">" prodigal_genes.fna | head -5 grep ">" combined_mags.fasta | head -5 # Gene headers should start with contig IDs from the FASTA ``` ### Config Validation Errors **Symptoms**: Pipeline fails immediately with a configuration error. ```bash # Validate your config against the template alleleflux init --output config_template.yml diff config.yml config_template.yml ``` Ensure all required keys are present (`input.fasta_path`, `input.metadata_path`, `input.mag_mapping_path`) and that `data_type` matches your `timepoints_combinations` structure. ### SLURM Job Failures **Symptoms**: Jobs submitted but fail with SLURM errors. ```bash # Check SLURM job status sacct -j --format=JobID,State,ExitCode,MaxRSS,Elapsed # Common SLURM states: # TIMEOUT -> Increase resources.time # OUT_OF_MEMORY -> Increase resources.mem_per_job # FAILED -> Check the job .out log file in logs/ ``` If jobs fail due to `TIMEOUT`, increase the wall time: ```yaml resources: time: "48:00:00" ``` ### Getting Help - Check the [FAQ](../reference/faq.md) for common questions - Open a [GitHub issue](https://github.com/MoellerLabPU/AlleleFlux/issues) with your error log - Include `alleleflux info` output and the full error traceback