Configuration Reference¶

Complete reference for AlleleFlux configuration options.

Quick Start¶

Copy the template configuration:

cp alleleflux/smk_workflow/config.template.yml my_config.yml

Edit my_config.yml with your file paths and analysis parameters.

Core Parameters¶

run_name (optional)

Unique identifier for this analysis.

run_name: my_study_2024

input¶

Paths to required input files.

Parameter	Description
`fasta_path`	Path to the combined reference FASTA file containing all MAG contigs. Header format should be `<MAG_ID>.fa_<contig_ID>`.
`prodigal_path`	Path to Prodigal gene predictions (nucleotide FASTA). Gene IDs must match contig IDs in the reference FASTA.
`metadata_path`	Path to sample metadata TSV file. Must contain columns: `sample_id`, `bam_path`, `subjectID`, `group`, `replicate`. For longitudinal data, also include `time`.
`gtdb_path`	Path to GTDB-Tk taxonomy file (`gtdbtk.bac120.summary.tsv`). Used for taxonomic aggregation of scores.
`mag_mapping_path`	Path to contig-to-MAG mapping file (TSV with `contig_name` and `mag_id` columns).

Example:

input:
  fasta_path: /path/to/combined_mags.fasta
  prodigal_path: /path/to/prodigal_genes.fna
  metadata_path: /path/to/sample_metadata.tsv
  gtdb_path: /path/to/gtdbtk.bac120.summary.tsv
  mag_mapping_path: /path/to/mag_mapping.tsv

output¶

Output directory configuration.

Parameter	Description
`root_dir`	Root directory for all output files. Subdirectories will be created for each analysis step.

Example:

output:
  root_dir: ./alleleflux_output

analysis¶

Core analysis settings.

Parameter	Default	Description
`data_type`	`longitudinal`	Type of analysis: `single` (one timepoint) or `longitudinal` (multiple timepoints).
`allele_analysis_only`	`false`	If true, only run allele frequency analysis without statistical tests.
`use_lmm`	`true`	Enable Linear Mixed Models (LMM) for repeated measures/longitudinal data. Best for accounting for subject-level variation.
`use_significance_tests`	`true`	Enable two-sample (t-test, Mann-Whitney) and single-sample statistical tests. Best for simple comparisons.
`use_cmh`	`true`	Enable Cochran-Mantel-Haenszel tests for stratified categorical analysis. Best for detecting consistent directional changes.
`timepoints_combinations`	Required	List of timepoint combinations to analyze (see below).
`groups_combinations`	Required	List of group pairs to compare (see below).

quality_control¶

Parameters for filtering samples and positions.

Parameter	Default	Description
`min_sample_num`	`4`	Minimum number of samples per group required for statistical tests. MAGs with fewer valid samples are marked as ineligible.
`breadth_threshold`	`0.1`	Minimum breadth of coverage (fraction of genome with ≥1x coverage). Range: 0.0-1.0.
`coverage_threshold`	`1.0`	Minimum average coverage depth required. Samples below this are excluded.
`disable_zero_diff_filtering`	`false`	If true, keep positions where allele frequencies do not change. By default, constant positions are filtered out.

Example:

quality_control:
  min_sample_num: 4
  breadth_threshold: 0.1
  coverage_threshold: 1.0
  disable_zero_diff_filtering: false

profiling¶

Parameters for BAM file processing during profiling.

Parameter	Default	Description
`ignore_orphans`	`true`	Ignore orphan reads (unpaired reads). Set to `false` to include unpaired reads.
`min_base_quality`	`30`	Minimum Phred base quality score to include a base in the pileup.
`min_mapping_quality`	`2`	Minimum mapping quality (MAPQ) score to include a read.
`ignore_overlaps`	`true`	Ignore overlapping segments of read pairs to avoid double-counting.

Example:

profiling:
  ignore_orphans: true
  min_base_quality: 30
  min_mapping_quality: 2
  ignore_overlaps: true

Note

Higher min_base_quality values (e.g., 30) reduce sequencing errors but may also reduce coverage. For high-quality data, the default of 30 is recommended.

statistics¶

Parameters for statistical testing.

Parameter	Default	Description
`filter_type`	`t-test`	Type of initial filter for preprocessing positions.
`preprocess_between_groups`	`true`	Enable preprocessing for between-group comparisons.
`preprocess_within_groups`	`true`	Enable preprocessing for within-group comparisons.
`max_zero_count`	`4`	Maximum number of zero-frequency samples allowed per position in preprocessing.
`p_value_threshold`	`0.05`	Significance threshold (alpha) for statistical tests.
`fdr_group_by_mag_id`	`false`	If true, apply FDR correction within each MAG. If false, apply across all positions.
`min_positions_after_preprocess`	`1`	Minimum number of positions required after preprocessing to proceed with analysis.

Example:

statistics:
  filter_type: t-test
  preprocess_between_groups: true
  preprocess_within_groups: true
  max_zero_count: 4
  p_value_threshold: 0.05
  fdr_group_by_mag_id: false
  min_positions_after_preprocess: 1

dnds¶

Parameters for dN/dS (synonymous/non-synonymous) ratio calculations.

Parameter	Default	Description
`p_value_column`	`q_value`	Column name to use for significance in dN/dS calculations.
`dn_ds_test_type`	`two_sample_unpaired_tTest`	Type of statistical test to use for dN/dS analysis.

Example:

dnds:
  p_value_column: q_value
  dn_ds_test_type: two_sample_unpaired_tTest

regional_contrast¶

Parameters for regional contrast analysis (longitudinal data only). Detects genes or sliding windows where treatment and control groups show consistently different allele-frequency evolution across paired hosts.

Note

Regional contrast analysis is only applicable to longitudinal data (data_type: longitudinal). It operates on raw allele-frequency changes without statistical preprocessing to avoid selection bias.

Parameter	Default	Description
`mode`	`both`	Region type(s) to analyze: `gene` (gene annotations), `window` (sliding windows), or `both`.
`window_size`	`1000`	Non-overlapping tile width in base pairs for sliding window analysis. Used when mode is `window` or `both`.
`agg_method`	`median`	How to summarize site scores within a region: `median` (robust), `mean` (simple average), or `trimmed_mean` (robust mean with custom tail trimming).
`trim_fraction`	`0.1`	Fraction of values to trim from each tail when using `agg_method: trimmed_mean`. Ignored for other aggregation methods.
`min_informative_sites`	`5`	Minimum number of variable sites required per region. Regions with fewer sites are excluded. Set to `0` to disable. Note: Sites with `site_score == 0` (perfect evolutionary stasis) are counted if they exist in the input.
`min_informative_fraction`	`0.0`	Minimum fraction of region length that must be covered by informative sites (0.0–1.0). Set to `0.0` to disable.
`use_fisher`	`true`	Also compute Fisher combined p-values from percentile-derived empirical p-values (secondary/exploratory analysis). Set to `false` to skip this computationally intensive step.
`use_regional_contrast`	`true`	Enable or disable regional contrast analysis entirely. Set to `false` to skip this analysis.

Example with default settings:

regional_contrast:
  mode: both
  window_size: 1000
  agg_method: median
  min_informative_sites: 5
  min_informative_fraction: 0.0
  use_fisher: true
  use_regional_contrast: true

Example with stringent filtering:

regional_contrast:
  mode: both
  window_size: 1000
  agg_method: trimmed_mean
  trim_fraction: 0.15
  min_informative_sites: 10
  min_informative_fraction: 0.5
  use_fisher: false

Understanding the parameters:

mode: gene focuses on annotated genes; window focuses on fixed-size genomic tiles; both runs both analyses in parallel.
window_size: Typical values range from 500 bp (fine-grained) to 5000 bp (coarse-grained). Smaller windows increase power for localized signals; larger windows increase robustness to sparse data.
agg_method: median is recommended for skewed data; trimmed_mean is robust to outliers; mean is simple but sensitive to extreme values.
min_informative_sites: Higher values improve statistical power but reduce the number of analyzable regions. A region with all sites showing site_score == 0 still counts as having full informative sites (no signal ≠ sparse data).
min_informative_fraction: Ensures that only well-sampled regions are tested. A value of 0.5 requires at least 50% of the region to have observed variable sites.
use_fisher: Fisher combined p-values provide an orthogonal statistical perspective but require additional computation. Set to false for large datasets if runtime is a concern.

Multiple Group Combinations¶

AlleleFlux supports running regional contrast (and all other analyses) on multiple group pairs simultaneously. Each pair is analyzed independently:

analysis:
  groups_combinations:
    - treatment: "high_fat"
      control: "control"
    - treatment: "high_fat"
      control: "standard"

For each group combination:

A separate eligibility table is generated based on sample quality metrics for that specific pair
Regional contrast analysis runs independently with separate output directories:
- regional_contrast/regional_contrast_{timepoints}-high_fat_control/
- regional_contrast/regional_contrast_{timepoints}-high_fat_standard/
Results files are segregated by group combination, preventing cross-contamination

Key note: The treatment and control wildcards in the Snakemake rule are constrained to valid values from your configuration, ensuring that only defined group pairs are processed.

resources¶

Computational resource allocation for cluster execution.

Parameter	Default	Description
`threads_per_job`	`16`	Number of CPU threads allocated to each job.
`mem_per_job`	`8G`	Memory allocation per job. Formats: `8G`, `16GB`, `8192M`.
`time`	`24:00:00`	Maximum wall time per job in HH:MM:SS format.

Example:

resources:
  threads_per_job: 16
  mem_per_job: 8G
  time: '24:00:00'

Complete Configuration Example¶

run_name: diet_microbiome_study

input:
  fasta_path: /data/mags/combined_mags.fasta
  prodigal_path: /data/mags/prodigal_genes.fna
  metadata_path: /data/metadata/samples_with_bam.tsv
  gtdb_path: /data/taxonomy/gtdbtk.bac120.summary.tsv
  mag_mapping_path: /data/mags/mag_mapping.tsv

output:
  root_dir: ./results

log_level: INFO

analysis:
  data_type: longitudinal
  allele_analysis_only: false
  use_lmm: true
  use_significance_tests: true
  use_cmh: true
  timepoints_combinations:
    - timepoint: [pre, post]
      focus: post
  groups_combinations:
    - [high_fat, control]

quality_control:
  min_sample_num: 4
  breadth_threshold: 0.1
  coverage_threshold: 1.0
  disable_zero_diff_filtering: false

profiling:
  ignore_orphans: true
  min_base_quality: 30
  min_mapping_quality: 2
  ignore_overlaps: true

statistics:
  filter_type: t-test
  preprocess_between_groups: true
  preprocess_within_groups: true
  max_zero_count: 4
  p_value_threshold: 0.05
  fdr_group_by_mag_id: false
  min_positions_after_preprocess: 1

dnds:
  p_value_column: q_value
  dn_ds_test_type: two_sample_unpaired_tTest

regional_contrast:
  mode: both
  window_size: 1000
  agg_method: median
  min_informative_sites: 5
  min_informative_fraction: 0.0
  use_fisher: true
  use_regional_contrast: true

resources:
  threads_per_job: 16
  mem_per_job: 8G
  time: '24:00:00'

Quick Tips¶

breadth_threshold: Start with 0.1 (10% coverage); increase for high-coverage data
min_sample_num: Minimum 4 samples per group for robust inference
min_base_quality: Keep at 30 for Illumina; lower to 20 for older data
Resource allocation: Adjust threads_per_job and mem_per_job based on MAG sizes

For worked examples, see Use Cases.