Configuration Reference¶
Complete reference for AlleleFlux configuration options.
Quick Start¶
Copy the template configuration:
cp alleleflux/smk_workflow/config.template.yml my_config.yml
Edit my_config.yml with your file paths and analysis parameters.
Core Parameters¶
run_name (optional)
Unique identifier for this analysis.
run_name: my_study_2024
input¶
Paths to required input files.
Parameter |
Description |
|---|---|
|
Path to the combined reference FASTA file containing all MAG contigs. Header format should be |
|
Path to Prodigal gene predictions (nucleotide FASTA). Gene IDs must match contig IDs in the reference FASTA. |
|
Path to sample metadata TSV file. Must contain columns: |
|
Path to GTDB-Tk taxonomy file ( |
|
Path to contig-to-MAG mapping file (TSV with |
Example:
input:
fasta_path: /path/to/combined_mags.fasta
prodigal_path: /path/to/prodigal_genes.fna
metadata_path: /path/to/sample_metadata.tsv
gtdb_path: /path/to/gtdbtk.bac120.summary.tsv
mag_mapping_path: /path/to/mag_mapping.tsv
output¶
Output directory configuration.
Parameter |
Description |
|---|---|
|
Root directory for all output files. Subdirectories will be created for each analysis step. |
Example:
output:
root_dir: ./alleleflux_output
analysis¶
Core analysis settings.
Parameter |
Default |
Description |
|---|---|---|
|
|
Type of analysis: |
|
|
If true, only run allele frequency analysis without statistical tests. |
|
|
Enable Linear Mixed Models (LMM) analysis for longitudinal data. |
|
|
Enable two-sample and single-sample statistical tests. |
|
|
Enable Cochran-Mantel-Haenszel (CMH) tests. |
|
Required |
List of timepoint combinations to analyze (see below). |
|
Required |
List of group pairs to compare (see below). |
Timepoints Configuration:
For longitudinal analysis, specify pairs of timepoints and a focus timepoint:
analysis:
data_type: longitudinal
timepoints_combinations:
- timepoint: [pre, end]
focus: end
- timepoint: [pre, mid]
focus: mid
For single timepoint analysis:
analysis:
data_type: single
timepoints_combinations:
- timepoint: [baseline]
Groups Configuration:
Specify pairs of groups to compare:
analysis:
groups_combinations:
- [treatment, control]
- [high_fat, standard]
quality_control¶
Parameters for filtering samples and positions.
Parameter |
Default |
Description |
|---|---|---|
|
|
Minimum number of samples per group required for statistical tests. MAGs with fewer valid samples are marked as ineligible. |
|
|
Minimum breadth of coverage (fraction of genome with ≥1x coverage). Range: 0.0-1.0. |
|
|
Minimum average coverage depth required. Samples below this are excluded. |
|
|
If true, keep positions where allele frequencies do not change. By default, constant positions are filtered out. |
Example:
quality_control:
min_sample_num: 4
breadth_threshold: 0.1
coverage_threshold: 1.0
disable_zero_diff_filtering: false
profiling¶
Parameters for BAM file processing during profiling.
Parameter |
Default |
Description |
|---|---|---|
|
|
Ignore orphan reads (unpaired reads). Set to |
|
|
Minimum Phred base quality score to include a base in the pileup. |
|
|
Minimum mapping quality (MAPQ) score to include a read. |
|
|
Ignore overlapping segments of read pairs to avoid double-counting. |
Example:
profiling:
ignore_orphans: true
min_base_quality: 30
min_mapping_quality: 2
ignore_overlaps: true
Note
Higher min_base_quality values (e.g., 30) reduce sequencing errors but may also reduce coverage. For high-quality data, the default of 30 is recommended.
statistics¶
Parameters for statistical testing.
Parameter |
Default |
Description |
|---|---|---|
|
|
Type of initial filter for preprocessing positions. |
|
|
Enable preprocessing for between-group comparisons. |
|
|
Enable preprocessing for within-group comparisons. |
|
|
Maximum number of zero-frequency samples allowed per position in preprocessing. |
|
|
Significance threshold (alpha) for statistical tests. |
|
|
If true, apply FDR correction within each MAG. If false, apply across all positions. |
|
|
Minimum number of positions required after preprocessing to proceed with analysis. |
Example:
statistics:
filter_type: t-test
preprocess_between_groups: true
preprocess_within_groups: true
max_zero_count: 4
p_value_threshold: 0.05
fdr_group_by_mag_id: false
min_positions_after_preprocess: 1
dnds¶
Parameters for dN/dS (synonymous/non-synonymous) ratio calculations.
Parameter |
Default |
Description |
|---|---|---|
|
|
Column name to use for significance in dN/dS calculations. |
|
|
Type of statistical test to use for dN/dS analysis. |
Example:
dnds:
p_value_column: q_value
dn_ds_test_type: two_sample_unpaired_tTest
resources¶
Computational resource allocation for cluster execution.
Parameter |
Default |
Description |
|---|---|---|
|
|
Number of CPU threads allocated to each job. |
|
|
Memory allocation per job. Formats: |
|
|
Maximum wall time per job in HH:MM:SS format. |
Example:
resources:
threads_per_job: 16
mem_per_job: 8G
time: '24:00:00'
Complete Configuration Example¶
run_name: diet_microbiome_study
input:
fasta_path: /data/mags/combined_mags.fasta
prodigal_path: /data/mags/prodigal_genes.fna
metadata_path: /data/metadata/samples_with_bam.tsv
gtdb_path: /data/taxonomy/gtdbtk.bac120.summary.tsv
mag_mapping_path: /data/mags/mag_mapping.tsv
output:
root_dir: ./results
log_level: INFO
analysis:
data_type: longitudinal
allele_analysis_only: false
use_lmm: true
use_significance_tests: true
use_cmh: true
timepoints_combinations:
- timepoint: [pre, post]
focus: post
groups_combinations:
- [high_fat, control]
quality_control:
min_sample_num: 4
breadth_threshold: 0.1
coverage_threshold: 1.0
disable_zero_diff_filtering: false
profiling:
ignore_orphans: true
min_base_quality: 30
min_mapping_quality: 2
ignore_overlaps: true
statistics:
filter_type: t-test
preprocess_between_groups: true
preprocess_within_groups: true
max_zero_count: 4
p_value_threshold: 0.05
fdr_group_by_mag_id: false
min_positions_after_preprocess: 1
dnds:
p_value_column: q_value
dn_ds_test_type: two_sample_unpaired_tTest
resources:
threads_per_job: 16
mem_per_job: 8G
time: '24:00:00'
Quick Tips¶
breadth_threshold: Start with
0.1(10% coverage); increase for high-coverage datamin_sample_num: Minimum
4samples per group for robust inferencemin_base_quality: Keep at
30for Illumina; lower to20for older dataResource allocation: Adjust
threads_per_jobandmem_per_jobbased on MAG sizes
For worked examples, see Use Cases.