# Configuration Reference Complete reference for AlleleFlux configuration options. ## Quick Start Copy the template configuration: ```bash cp alleleflux/smk_workflow/config.template.yml my_config.yml ``` Edit `my_config.yml` with your file paths and analysis parameters. ## Core Parameters **run_name** (optional) Unique identifier for this analysis. ```yaml run_name: my_study_2024 ``` --- ### input Paths to required input files. | Parameter | Description | |-----------|-------------| | `fasta_path` | Path to the combined reference FASTA file containing all MAG contigs. Header format should be `.fa_`. | | `prodigal_path` | Path to Prodigal gene predictions (nucleotide FASTA). Gene IDs must match contig IDs in the reference FASTA. | | `metadata_path` | Path to sample metadata TSV file. Must contain columns: `sample_id`, `bam_path`, `subjectID`, `group`, `replicate`. For longitudinal data, also include `time`. | | `gtdb_path` | Path to GTDB-Tk taxonomy file (`gtdbtk.bac120.summary.tsv`). Used for taxonomic aggregation of scores. | | `mag_mapping_path` | Path to contig-to-MAG mapping file (TSV with `contig_name` and `mag_id` columns). | **Example:** ```yaml input: fasta_path: /path/to/combined_mags.fasta prodigal_path: /path/to/prodigal_genes.fna metadata_path: /path/to/sample_metadata.tsv gtdb_path: /path/to/gtdbtk.bac120.summary.tsv mag_mapping_path: /path/to/mag_mapping.tsv ``` --- ### output Output directory configuration. | Parameter | Description | |-----------|-------------| | `root_dir` | Root directory for all output files. Subdirectories will be created for each analysis step. | **Example:** ```yaml output: root_dir: ./alleleflux_output ``` --- ### analysis Core analysis settings. | Parameter | Default | Description | |-----------|---------|-------------| | `data_type` | `longitudinal` | Type of analysis: `single` (one timepoint) or `longitudinal` (multiple timepoints). | | `allele_analysis_only` | `false` | If true, only run allele frequency analysis without statistical tests. | | `use_lmm` | `true` | Enable Linear Mixed Models (LMM) for repeated measures/longitudinal data. Best for accounting for subject-level variation. | | `use_significance_tests` | `true` | Enable two-sample (t-test, Mann-Whitney) and single-sample statistical tests. Best for simple comparisons. | | `use_cmh` | `true` | Enable Cochran-Mantel-Haenszel tests for stratified categorical analysis. Best for detecting consistent directional changes. | | `timepoints_combinations` | Required | List of timepoint combinations to analyze (see below). | | `groups_combinations` | Required | List of group pairs to compare (see below). | :::{seealso} For detailed information about statistical tests and score calculations, see [Statistical Tests Reference](statistical_tests.md). ::: **Timepoints Configuration:** For longitudinal analysis, specify pairs of timepoints and a **focus timepoint**: ```yaml analysis: data_type: longitudinal timepoints_combinations: - timepoint: [pre, post] focus: post # The later/derived timepoint - timepoint: [pre, mid] focus: mid ``` **Understanding the Focus Timepoint:** The focus timepoint represents the **derived** or **later** state in evolutionary comparisons: - **For dN/dS analysis**: The focus timepoint is treated as the "derived" (Time 2) state, while the other timepoint is "ancestral" (Time 1). AlleleFlux calculates evolutionary changes in the direction: ancestral → derived. - **For CMH scores**: The score measures differential significance relative to the focus timepoint (sites significant at focus but not at the other timepoint). - **Selection guideline**: Always choose the **later** or **endpoint** timepoint as focus. - **Default behavior**: If not specified, defaults to the second timepoint in the list. **Examples:** ```yaml # Typical longitudinal study: Day 0 → Day 30 timepoints_combinations: - timepoint: [day0, day30] focus: day30 # day30 is derived, day0 is ancestral # Treatment study: Baseline → Post-treatment timepoints_combinations: - timepoint: [baseline, post_treatment] focus: post_treatment # Post is derived state ``` For single timepoint analysis: ```yaml analysis: data_type: single timepoints_combinations: - timepoint: [baseline] # No focus needed for single timepoint ``` **Groups Configuration:** Specify pairs of groups to compare: ```yaml analysis: groups_combinations: - [treatment, control] - [high_fat, standard] ``` --- ### quality_control Parameters for filtering samples and positions. | Parameter | Default | Description | |-----------|---------|-------------| | `min_sample_num` | `4` | Minimum number of samples per group required for statistical tests. MAGs with fewer valid samples are marked as ineligible. | | `breadth_threshold` | `0.1` | Minimum breadth of coverage (fraction of genome with ≥1x coverage). Range: 0.0-1.0. | | `coverage_threshold` | `1.0` | Minimum average coverage depth required. Samples below this are excluded. | | `disable_zero_diff_filtering` | `false` | If true, keep positions where allele frequencies do not change. By default, constant positions are filtered out. | **Example:** ```yaml quality_control: min_sample_num: 4 breadth_threshold: 0.1 coverage_threshold: 1.0 disable_zero_diff_filtering: false ``` --- ### profiling Parameters for BAM file processing during profiling. | Parameter | Default | Description | |-----------|---------|-------------| | `ignore_orphans` | `true` | Ignore orphan reads (unpaired reads). Set to `false` to include unpaired reads. | | `min_base_quality` | `30` | Minimum Phred base quality score to include a base in the pileup. | | `min_mapping_quality` | `2` | Minimum mapping quality (MAPQ) score to include a read. | | `ignore_overlaps` | `true` | Ignore overlapping segments of read pairs to avoid double-counting. | **Example:** ```yaml profiling: ignore_orphans: true min_base_quality: 30 min_mapping_quality: 2 ignore_overlaps: true ``` :::{note} Higher `min_base_quality` values (e.g., 30) reduce sequencing errors but may also reduce coverage. For high-quality data, the default of 30 is recommended. ::: --- ### statistics Parameters for statistical testing. | Parameter | Default | Description | |-----------|---------|-------------| | `filter_type` | `t-test` | Type of initial filter for preprocessing positions. | | `preprocess_between_groups` | `true` | Enable preprocessing for between-group comparisons. | | `preprocess_within_groups` | `true` | Enable preprocessing for within-group comparisons. | | `max_zero_count` | `4` | Maximum number of zero-frequency samples allowed per position in preprocessing. | | `p_value_threshold` | `0.05` | Significance threshold (alpha) for statistical tests. | | `fdr_group_by_mag_id` | `false` | If true, apply FDR correction within each MAG. If false, apply across all positions. | | `min_positions_after_preprocess` | `1` | Minimum number of positions required after preprocessing to proceed with analysis. | **Example:** ```yaml statistics: filter_type: t-test preprocess_between_groups: true preprocess_within_groups: true max_zero_count: 4 p_value_threshold: 0.05 fdr_group_by_mag_id: false min_positions_after_preprocess: 1 ``` --- ### dnds Parameters for dN/dS (synonymous/non-synonymous) ratio calculations. | Parameter | Default | Description | |-----------|---------|-------------| | `p_value_column` | `q_value` | Column name to use for significance in dN/dS calculations. | | `dn_ds_test_type` | `two_sample_unpaired_tTest` | Type of statistical test to use for dN/dS analysis. | **Example:** ```yaml dnds: p_value_column: q_value dn_ds_test_type: two_sample_unpaired_tTest ``` --- ### regional_contrast Parameters for regional contrast analysis (longitudinal data only). Detects genes or sliding windows where treatment and control groups show consistently different allele-frequency evolution across paired hosts. :::{note} Regional contrast analysis is **only applicable to longitudinal data** (`data_type: longitudinal`). It operates on raw allele-frequency changes without statistical preprocessing to avoid selection bias. ::: | Parameter | Default | Description | |-----------|---------|-------------| | `mode` | `both` | Region type(s) to analyze: `gene` (gene annotations), `window` (sliding windows), or `both`. | | `window_size` | `1000` | Non-overlapping tile width in base pairs for sliding window analysis. Used when mode is `window` or `both`. | | `agg_method` | `median` | How to summarize site scores within a region: `median` (robust), `mean` (simple average), or `trimmed_mean` (robust mean with custom tail trimming). | | `trim_fraction` | `0.1` | Fraction of values to trim from each tail when using `agg_method: trimmed_mean`. Ignored for other aggregation methods. | | `min_informative_sites` | `5` | Minimum number of variable sites required per region. Regions with fewer sites are excluded. Set to `0` to disable. **Note:** Sites with `site_score == 0` (perfect evolutionary stasis) are counted if they exist in the input. | | `min_informative_fraction` | `0.0` | Minimum fraction of region length that must be covered by informative sites (0.0–1.0). Set to `0.0` to disable. | | `use_fisher` | `true` | Also compute Fisher combined p-values from percentile-derived empirical p-values (secondary/exploratory analysis). Set to `false` to skip this computationally intensive step. | | `use_regional_contrast` | `true` | Enable or disable regional contrast analysis entirely. Set to `false` to skip this analysis. | **Example with default settings:** ```yaml regional_contrast: mode: both window_size: 1000 agg_method: median min_informative_sites: 5 min_informative_fraction: 0.0 use_fisher: true use_regional_contrast: true ``` **Example with stringent filtering:** ```yaml regional_contrast: mode: both window_size: 1000 agg_method: trimmed_mean trim_fraction: 0.15 min_informative_sites: 10 min_informative_fraction: 0.5 use_fisher: false ``` **Understanding the parameters:** - **mode**: `gene` focuses on annotated genes; `window` focuses on fixed-size genomic tiles; `both` runs both analyses in parallel. - **window_size**: Typical values range from 500 bp (fine-grained) to 5000 bp (coarse-grained). Smaller windows increase power for localized signals; larger windows increase robustness to sparse data. - **agg_method**: `median` is recommended for skewed data; `trimmed_mean` is robust to outliers; `mean` is simple but sensitive to extreme values. - **min_informative_sites**: Higher values improve statistical power but reduce the number of analyzable regions. A region with all sites showing `site_score == 0` still counts as having full informative sites (no signal ≠ sparse data). - **min_informative_fraction**: Ensures that only well-sampled regions are tested. A value of 0.5 requires at least 50% of the region to have observed variable sites. - **use_fisher**: Fisher combined p-values provide an orthogonal statistical perspective but require additional computation. Set to `false` for large datasets if runtime is a concern. --- ### Multiple Group Combinations AlleleFlux supports running regional contrast (and all other analyses) on **multiple group pairs simultaneously**. Each pair is analyzed independently: ```yaml analysis: groups_combinations: - treatment: "high_fat" control: "control" - treatment: "high_fat" control: "standard" ``` For each group combination: - A separate eligibility table is generated based on sample quality metrics for that specific pair - Regional contrast analysis runs independently with separate output directories: - `regional_contrast/regional_contrast_{timepoints}-high_fat_control/` - `regional_contrast/regional_contrast_{timepoints}-high_fat_standard/` - Results files are segregated by group combination, preventing cross-contamination **Key note:** The `treatment` and `control` wildcards in the Snakemake rule are constrained to valid values from your configuration, ensuring that only defined group pairs are processed. --- ### resources Computational resource allocation for cluster execution. | Parameter | Default | Description | |-----------|---------|-------------| | `threads_per_job` | `16` | Number of CPU threads allocated to each job. | | `mem_per_job` | `8G` | Memory allocation per job. Formats: `8G`, `16GB`, `8192M`. | | `time` | `24:00:00` | Maximum wall time per job in HH:MM:SS format. | **Example:** ```yaml resources: threads_per_job: 16 mem_per_job: 8G time: '24:00:00' ``` --- ## Complete Configuration Example ```yaml run_name: diet_microbiome_study input: fasta_path: /data/mags/combined_mags.fasta prodigal_path: /data/mags/prodigal_genes.fna metadata_path: /data/metadata/samples_with_bam.tsv gtdb_path: /data/taxonomy/gtdbtk.bac120.summary.tsv mag_mapping_path: /data/mags/mag_mapping.tsv output: root_dir: ./results log_level: INFO analysis: data_type: longitudinal allele_analysis_only: false use_lmm: true use_significance_tests: true use_cmh: true timepoints_combinations: - timepoint: [pre, post] focus: post groups_combinations: - [high_fat, control] quality_control: min_sample_num: 4 breadth_threshold: 0.1 coverage_threshold: 1.0 disable_zero_diff_filtering: false profiling: ignore_orphans: true min_base_quality: 30 min_mapping_quality: 2 ignore_overlaps: true statistics: filter_type: t-test preprocess_between_groups: true preprocess_within_groups: true max_zero_count: 4 p_value_threshold: 0.05 fdr_group_by_mag_id: false min_positions_after_preprocess: 1 dnds: p_value_column: q_value dn_ds_test_type: two_sample_unpaired_tTest regional_contrast: mode: both window_size: 1000 agg_method: median min_informative_sites: 5 min_informative_fraction: 0.0 use_fisher: true use_regional_contrast: true resources: threads_per_job: 16 mem_per_job: 8G time: '24:00:00' ``` ## Quick Tips - **breadth_threshold**: Start with `0.1` (10% coverage); increase for high-coverage data - **min_sample_num**: Minimum `4` samples per group for robust inference - **min_base_quality**: Keep at `30` for Illumina; lower to `20` for older data - **Resource allocation**: Adjust `threads_per_job` and `mem_per_job` based on MAG sizes For worked examples, see [Use Cases](../examples/use_cases.md).