Configuration Reference¶
Complete reference for AlleleFlux configuration options.
Quick Start¶
Copy the template configuration:
cp alleleflux/smk_workflow/config.template.yml my_config.yml
Edit my_config.yml with your file paths and analysis parameters.
Core Parameters¶
run_name (optional)
Unique identifier for this analysis.
run_name: my_study_2024
input¶
Paths to required input files.
Parameter |
Description |
|---|---|
|
Path to the combined reference FASTA file containing all MAG contigs. Header format should be |
|
Path to Prodigal gene predictions (nucleotide FASTA). Gene IDs must match contig IDs in the reference FASTA. |
|
Path to sample metadata TSV file. Must contain columns: |
|
Path to GTDB-Tk taxonomy file ( |
|
Path to contig-to-MAG mapping file (TSV with |
Example:
input:
fasta_path: /path/to/combined_mags.fasta
prodigal_path: /path/to/prodigal_genes.fna
metadata_path: /path/to/sample_metadata.tsv
gtdb_path: /path/to/gtdbtk.bac120.summary.tsv
mag_mapping_path: /path/to/mag_mapping.tsv
output¶
Output directory configuration.
Parameter |
Description |
|---|---|
|
Root directory for all output files. Subdirectories will be created for each analysis step. |
Example:
output:
root_dir: ./alleleflux_output
analysis¶
Core analysis settings.
Parameter |
Default |
Description |
|---|---|---|
|
|
Type of analysis: |
|
|
If true, only run allele frequency analysis without statistical tests. |
|
|
Enable Linear Mixed Models (LMM) for repeated measures/longitudinal data. Best for accounting for subject-level variation. |
|
|
Enable two-sample (t-test, Mann-Whitney) and single-sample statistical tests. Best for simple comparisons. |
|
|
Enable Cochran-Mantel-Haenszel tests for stratified categorical analysis. Best for detecting consistent directional changes. |
|
Required |
List of timepoint combinations to analyze (see below). |
|
Required |
List of group pairs to compare (see below). |
See also
For detailed information about statistical tests and score calculations, see Statistical Tests Reference.
Timepoints Configuration:
For longitudinal analysis, specify pairs of timepoints and a focus timepoint:
analysis:
data_type: longitudinal
timepoints_combinations:
- timepoint: [pre, post]
focus: post # The later/derived timepoint
- timepoint: [pre, mid]
focus: mid
Understanding the Focus Timepoint:
The focus timepoint represents the derived or later state in evolutionary comparisons:
For dN/dS analysis: The focus timepoint is treated as the “derived” (Time 2) state, while the other timepoint is “ancestral” (Time 1). AlleleFlux calculates evolutionary changes in the direction: ancestral → derived.
For CMH scores: The score measures differential significance relative to the focus timepoint (sites significant at focus but not at the other timepoint).
Selection guideline: Always choose the later or endpoint timepoint as focus.
Default behavior: If not specified, defaults to the second timepoint in the list.
Examples:
# Typical longitudinal study: Day 0 → Day 30
timepoints_combinations:
- timepoint: [day0, day30]
focus: day30 # day30 is derived, day0 is ancestral
# Treatment study: Baseline → Post-treatment
timepoints_combinations:
- timepoint: [baseline, post_treatment]
focus: post_treatment # Post is derived state
For single timepoint analysis:
analysis:
data_type: single
timepoints_combinations:
- timepoint: [baseline] # No focus needed for single timepoint
Groups Configuration:
Specify pairs of groups to compare:
analysis:
groups_combinations:
- [treatment, control]
- [high_fat, standard]
quality_control¶
Parameters for filtering samples and positions.
Parameter |
Default |
Description |
|---|---|---|
|
|
Minimum number of samples per group required for statistical tests. MAGs with fewer valid samples are marked as ineligible. |
|
|
Minimum breadth of coverage (fraction of genome with ≥1x coverage). Range: 0.0-1.0. |
|
|
Minimum average coverage depth required. Samples below this are excluded. |
|
|
If true, keep positions where allele frequencies do not change. By default, constant positions are filtered out. |
Example:
quality_control:
min_sample_num: 4
breadth_threshold: 0.1
coverage_threshold: 1.0
disable_zero_diff_filtering: false
profiling¶
Parameters for BAM file processing during profiling.
Parameter |
Default |
Description |
|---|---|---|
|
|
Ignore orphan reads (unpaired reads). Set to |
|
|
Minimum Phred base quality score to include a base in the pileup. |
|
|
Minimum mapping quality (MAPQ) score to include a read. |
|
|
Ignore overlapping segments of read pairs to avoid double-counting. |
Example:
profiling:
ignore_orphans: true
min_base_quality: 30
min_mapping_quality: 2
ignore_overlaps: true
Note
Higher min_base_quality values (e.g., 30) reduce sequencing errors but may also reduce coverage. For high-quality data, the default of 30 is recommended.
statistics¶
Parameters for statistical testing.
Parameter |
Default |
Description |
|---|---|---|
|
|
Type of initial filter for preprocessing positions. |
|
|
Enable preprocessing for between-group comparisons. |
|
|
Enable preprocessing for within-group comparisons. |
|
|
Maximum number of zero-frequency samples allowed per position in preprocessing. |
|
|
Significance threshold (alpha) for statistical tests. |
|
|
If true, apply FDR correction within each MAG. If false, apply across all positions. |
|
|
Minimum number of positions required after preprocessing to proceed with analysis. |
Example:
statistics:
filter_type: t-test
preprocess_between_groups: true
preprocess_within_groups: true
max_zero_count: 4
p_value_threshold: 0.05
fdr_group_by_mag_id: false
min_positions_after_preprocess: 1
dnds¶
Parameters for dN/dS (synonymous/non-synonymous) ratio calculations.
Parameter |
Default |
Description |
|---|---|---|
|
|
Column name to use for significance in dN/dS calculations. |
|
|
Type of statistical test to use for dN/dS analysis. |
Example:
dnds:
p_value_column: q_value
dn_ds_test_type: two_sample_unpaired_tTest
regional_contrast¶
Parameters for regional contrast analysis (longitudinal data only). Detects genes or sliding windows where treatment and control groups show consistently different allele-frequency evolution across paired hosts.
Note
Regional contrast analysis is only applicable to longitudinal data (data_type: longitudinal). It operates on raw allele-frequency changes without statistical preprocessing to avoid selection bias.
Parameter |
Default |
Description |
|---|---|---|
|
|
Region type(s) to analyze: |
|
|
Non-overlapping tile width in base pairs for sliding window analysis. Used when mode is |
|
|
How to summarize site scores within a region: |
|
|
Fraction of values to trim from each tail when using |
|
|
Minimum number of variable sites required per region. Regions with fewer sites are excluded. Set to |
|
|
Minimum fraction of region length that must be covered by informative sites (0.0–1.0). Set to |
|
|
Also compute Fisher combined p-values from percentile-derived empirical p-values (secondary/exploratory analysis). Set to |
|
|
Enable or disable regional contrast analysis entirely. Set to |
Example with default settings:
regional_contrast:
mode: both
window_size: 1000
agg_method: median
min_informative_sites: 5
min_informative_fraction: 0.0
use_fisher: true
use_regional_contrast: true
Example with stringent filtering:
regional_contrast:
mode: both
window_size: 1000
agg_method: trimmed_mean
trim_fraction: 0.15
min_informative_sites: 10
min_informative_fraction: 0.5
use_fisher: false
Understanding the parameters:
mode:
genefocuses on annotated genes;windowfocuses on fixed-size genomic tiles;bothruns both analyses in parallel.window_size: Typical values range from 500 bp (fine-grained) to 5000 bp (coarse-grained). Smaller windows increase power for localized signals; larger windows increase robustness to sparse data.
agg_method:
medianis recommended for skewed data;trimmed_meanis robust to outliers;meanis simple but sensitive to extreme values.min_informative_sites: Higher values improve statistical power but reduce the number of analyzable regions. A region with all sites showing
site_score == 0still counts as having full informative sites (no signal ≠ sparse data).min_informative_fraction: Ensures that only well-sampled regions are tested. A value of 0.5 requires at least 50% of the region to have observed variable sites.
use_fisher: Fisher combined p-values provide an orthogonal statistical perspective but require additional computation. Set to
falsefor large datasets if runtime is a concern.
Multiple Group Combinations¶
AlleleFlux supports running regional contrast (and all other analyses) on multiple group pairs simultaneously. Each pair is analyzed independently:
analysis:
groups_combinations:
- treatment: "high_fat"
control: "control"
- treatment: "high_fat"
control: "standard"
For each group combination:
A separate eligibility table is generated based on sample quality metrics for that specific pair
Regional contrast analysis runs independently with separate output directories:
regional_contrast/regional_contrast_{timepoints}-high_fat_control/regional_contrast/regional_contrast_{timepoints}-high_fat_standard/
Results files are segregated by group combination, preventing cross-contamination
Key note: The treatment and control wildcards in the Snakemake rule are constrained to valid values from your configuration, ensuring that only defined group pairs are processed.
resources¶
Computational resource allocation for cluster execution.
Parameter |
Default |
Description |
|---|---|---|
|
|
Number of CPU threads allocated to each job. |
|
|
Memory allocation per job. Formats: |
|
|
Maximum wall time per job in HH:MM:SS format. |
Example:
resources:
threads_per_job: 16
mem_per_job: 8G
time: '24:00:00'
Complete Configuration Example¶
run_name: diet_microbiome_study
input:
fasta_path: /data/mags/combined_mags.fasta
prodigal_path: /data/mags/prodigal_genes.fna
metadata_path: /data/metadata/samples_with_bam.tsv
gtdb_path: /data/taxonomy/gtdbtk.bac120.summary.tsv
mag_mapping_path: /data/mags/mag_mapping.tsv
output:
root_dir: ./results
log_level: INFO
analysis:
data_type: longitudinal
allele_analysis_only: false
use_lmm: true
use_significance_tests: true
use_cmh: true
timepoints_combinations:
- timepoint: [pre, post]
focus: post
groups_combinations:
- [high_fat, control]
quality_control:
min_sample_num: 4
breadth_threshold: 0.1
coverage_threshold: 1.0
disable_zero_diff_filtering: false
profiling:
ignore_orphans: true
min_base_quality: 30
min_mapping_quality: 2
ignore_overlaps: true
statistics:
filter_type: t-test
preprocess_between_groups: true
preprocess_within_groups: true
max_zero_count: 4
p_value_threshold: 0.05
fdr_group_by_mag_id: false
min_positions_after_preprocess: 1
dnds:
p_value_column: q_value
dn_ds_test_type: two_sample_unpaired_tTest
regional_contrast:
mode: both
window_size: 1000
agg_method: median
min_informative_sites: 5
min_informative_fraction: 0.0
use_fisher: true
use_regional_contrast: true
resources:
threads_per_job: 16
mem_per_job: 8G
time: '24:00:00'
Quick Tips¶
breadth_threshold: Start with
0.1(10% coverage); increase for high-coverage datamin_sample_num: Minimum
4samples per group for robust inferencemin_base_quality: Keep at
30for Illumina; lower to20for older dataResource allocation: Adjust
threads_per_jobandmem_per_jobbased on MAG sizes
For worked examples, see Use Cases.