# Configuration Reference Complete reference for AlleleFlux configuration options. ## Quick Start Copy the template configuration: ```bash cp alleleflux/smk_workflow/config.template.yml my_config.yml ``` Edit `my_config.yml` with your file paths and analysis parameters. ## Core Parameters **run_name** (optional) Unique identifier for this analysis. ```yaml run_name: my_study_2024 ``` --- ### input Paths to required input files. | Parameter | Description | |-----------|-------------| | `fasta_path` | Path to the combined reference FASTA file containing all MAG contigs. Header format should be `.fa_`. | | `prodigal_path` | Path to Prodigal gene predictions (nucleotide FASTA). Gene IDs must match contig IDs in the reference FASTA. | | `metadata_path` | Path to sample metadata TSV file. Must contain columns: `sample_id`, `bam_path`, `subjectID`, `group`, `replicate`. For longitudinal data, also include `time`. | | `gtdb_path` | Path to GTDB-Tk taxonomy file (`gtdbtk.bac120.summary.tsv`). Used for taxonomic aggregation of scores. | | `mag_mapping_path` | Path to contig-to-MAG mapping file (TSV with `contig_name` and `mag_id` columns). | **Example:** ```yaml input: fasta_path: /path/to/combined_mags.fasta prodigal_path: /path/to/prodigal_genes.fna metadata_path: /path/to/sample_metadata.tsv gtdb_path: /path/to/gtdbtk.bac120.summary.tsv mag_mapping_path: /path/to/mag_mapping.tsv ``` --- ### output Output directory configuration. | Parameter | Description | |-----------|-------------| | `root_dir` | Root directory for all output files. Subdirectories will be created for each analysis step. | **Example:** ```yaml output: root_dir: ./alleleflux_output ``` --- ### analysis Core analysis settings. | Parameter | Default | Description | |-----------|---------|-------------| | `data_type` | `longitudinal` | Type of analysis: `single` (one timepoint) or `longitudinal` (multiple timepoints). | | `allele_analysis_only` | `false` | If true, only run allele frequency analysis without statistical tests. | | `use_lmm` | `true` | Enable Linear Mixed Models (LMM) analysis for longitudinal data. | | `use_significance_tests` | `true` | Enable two-sample and single-sample statistical tests. | | `use_cmh` | `true` | Enable Cochran-Mantel-Haenszel (CMH) tests. | | `timepoints_combinations` | Required | List of timepoint combinations to analyze (see below). | | `groups_combinations` | Required | List of group pairs to compare (see below). | **Timepoints Configuration:** For longitudinal analysis, specify pairs of timepoints and a focus timepoint: ```yaml analysis: data_type: longitudinal timepoints_combinations: - timepoint: [pre, end] focus: end - timepoint: [pre, mid] focus: mid ``` For single timepoint analysis: ```yaml analysis: data_type: single timepoints_combinations: - timepoint: [baseline] ``` **Groups Configuration:** Specify pairs of groups to compare: ```yaml analysis: groups_combinations: - [treatment, control] - [high_fat, standard] ``` --- ### quality_control Parameters for filtering samples and positions. | Parameter | Default | Description | |-----------|---------|-------------| | `min_sample_num` | `4` | Minimum number of samples per group required for statistical tests. MAGs with fewer valid samples are marked as ineligible. | | `breadth_threshold` | `0.1` | Minimum breadth of coverage (fraction of genome with ≥1x coverage). Range: 0.0-1.0. | | `coverage_threshold` | `1.0` | Minimum average coverage depth required. Samples below this are excluded. | | `disable_zero_diff_filtering` | `false` | If true, keep positions where allele frequencies do not change. By default, constant positions are filtered out. | **Example:** ```yaml quality_control: min_sample_num: 4 breadth_threshold: 0.1 coverage_threshold: 1.0 disable_zero_diff_filtering: false ``` --- ### profiling Parameters for BAM file processing during profiling. | Parameter | Default | Description | |-----------|---------|-------------| | `ignore_orphans` | `true` | Ignore orphan reads (unpaired reads). Set to `false` to include unpaired reads. | | `min_base_quality` | `30` | Minimum Phred base quality score to include a base in the pileup. | | `min_mapping_quality` | `2` | Minimum mapping quality (MAPQ) score to include a read. | | `ignore_overlaps` | `true` | Ignore overlapping segments of read pairs to avoid double-counting. | **Example:** ```yaml profiling: ignore_orphans: true min_base_quality: 30 min_mapping_quality: 2 ignore_overlaps: true ``` :::{note} Higher `min_base_quality` values (e.g., 30) reduce sequencing errors but may also reduce coverage. For high-quality data, the default of 30 is recommended. ::: --- ### statistics Parameters for statistical testing. | Parameter | Default | Description | |-----------|---------|-------------| | `filter_type` | `t-test` | Type of initial filter for preprocessing positions. | | `preprocess_between_groups` | `true` | Enable preprocessing for between-group comparisons. | | `preprocess_within_groups` | `true` | Enable preprocessing for within-group comparisons. | | `max_zero_count` | `4` | Maximum number of zero-frequency samples allowed per position in preprocessing. | | `p_value_threshold` | `0.05` | Significance threshold (alpha) for statistical tests. | | `fdr_group_by_mag_id` | `false` | If true, apply FDR correction within each MAG. If false, apply across all positions. | | `min_positions_after_preprocess` | `1` | Minimum number of positions required after preprocessing to proceed with analysis. | **Example:** ```yaml statistics: filter_type: t-test preprocess_between_groups: true preprocess_within_groups: true max_zero_count: 4 p_value_threshold: 0.05 fdr_group_by_mag_id: false min_positions_after_preprocess: 1 ``` --- ### dnds Parameters for dN/dS (synonymous/non-synonymous) ratio calculations. | Parameter | Default | Description | |-----------|---------|-------------| | `p_value_column` | `q_value` | Column name to use for significance in dN/dS calculations. | | `dn_ds_test_type` | `two_sample_unpaired_tTest` | Type of statistical test to use for dN/dS analysis. | **Example:** ```yaml dnds: p_value_column: q_value dn_ds_test_type: two_sample_unpaired_tTest ``` --- ### resources Computational resource allocation for cluster execution. | Parameter | Default | Description | |-----------|---------|-------------| | `threads_per_job` | `16` | Number of CPU threads allocated to each job. | | `mem_per_job` | `8G` | Memory allocation per job. Formats: `8G`, `16GB`, `8192M`. | | `time` | `24:00:00` | Maximum wall time per job in HH:MM:SS format. | **Example:** ```yaml resources: threads_per_job: 16 mem_per_job: 8G time: '24:00:00' ``` --- ## Complete Configuration Example ```yaml run_name: diet_microbiome_study input: fasta_path: /data/mags/combined_mags.fasta prodigal_path: /data/mags/prodigal_genes.fna metadata_path: /data/metadata/samples_with_bam.tsv gtdb_path: /data/taxonomy/gtdbtk.bac120.summary.tsv mag_mapping_path: /data/mags/mag_mapping.tsv output: root_dir: ./results log_level: INFO analysis: data_type: longitudinal allele_analysis_only: false use_lmm: true use_significance_tests: true use_cmh: true timepoints_combinations: - timepoint: [pre, post] focus: post groups_combinations: - [high_fat, control] quality_control: min_sample_num: 4 breadth_threshold: 0.1 coverage_threshold: 1.0 disable_zero_diff_filtering: false profiling: ignore_orphans: true min_base_quality: 30 min_mapping_quality: 2 ignore_overlaps: true statistics: filter_type: t-test preprocess_between_groups: true preprocess_within_groups: true max_zero_count: 4 p_value_threshold: 0.05 fdr_group_by_mag_id: false min_positions_after_preprocess: 1 dnds: p_value_column: q_value dn_ds_test_type: two_sample_unpaired_tTest resources: threads_per_job: 16 mem_per_job: 8G time: '24:00:00' ``` ## Quick Tips - **breadth_threshold**: Start with `0.1` (10% coverage); increase for high-coverage data - **min_sample_num**: Minimum `4` samples per group for robust inference - **min_base_quality**: Keep at `30` for Illumina; lower to `20` for older data - **Resource allocation**: Adjust `threads_per_job` and `mem_per_job` based on MAG sizes For worked examples, see [Use Cases](../examples/use_cases.md).