Configuration Reference

Complete reference for AlleleFlux configuration options.

Quick Start

Copy the template configuration:

cp alleleflux/smk_workflow/config.template.yml my_config.yml

Edit my_config.yml with your file paths and analysis parameters.

Core Parameters

run_name (optional)

Unique identifier for this analysis.

run_name: my_study_2024

input

Paths to required input files.

Parameter

Description

fasta_path

Path to the combined reference FASTA file containing all MAG contigs. Header format should be <MAG_ID>.fa_<contig_ID>.

prodigal_path

Path to Prodigal gene predictions (nucleotide FASTA). Gene IDs must match contig IDs in the reference FASTA.

metadata_path

Path to sample metadata TSV file. Must contain columns: sample_id, bam_path, subjectID, group, replicate. For longitudinal data, also include time.

gtdb_path

Path to GTDB-Tk taxonomy file (gtdbtk.bac120.summary.tsv). Used for taxonomic aggregation of scores.

mag_mapping_path

Path to contig-to-MAG mapping file (TSV with contig_name and mag_id columns).

Example:

input:
  fasta_path: /path/to/combined_mags.fasta
  prodigal_path: /path/to/prodigal_genes.fna
  metadata_path: /path/to/sample_metadata.tsv
  gtdb_path: /path/to/gtdbtk.bac120.summary.tsv
  mag_mapping_path: /path/to/mag_mapping.tsv

output

Output directory configuration.

Parameter

Description

root_dir

Root directory for all output files. Subdirectories will be created for each analysis step.

Example:

output:
  root_dir: ./alleleflux_output

analysis

Core analysis settings.

Parameter

Default

Description

data_type

longitudinal

Type of analysis: single (one timepoint) or longitudinal (multiple timepoints).

allele_analysis_only

false

If true, only run allele frequency analysis without statistical tests.

use_lmm

true

Enable Linear Mixed Models (LMM) analysis for longitudinal data.

use_significance_tests

true

Enable two-sample and single-sample statistical tests.

use_cmh

true

Enable Cochran-Mantel-Haenszel (CMH) tests.

timepoints_combinations

Required

List of timepoint combinations to analyze (see below).

groups_combinations

Required

List of group pairs to compare (see below).

Timepoints Configuration:

For longitudinal analysis, specify pairs of timepoints and a focus timepoint:

analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: [pre, end]
      focus: end
    - timepoint: [pre, mid]
      focus: mid

For single timepoint analysis:

analysis:
  data_type: single
  timepoints_combinations:
    - timepoint: [baseline]

Groups Configuration:

Specify pairs of groups to compare:

analysis:
  groups_combinations:
    - [treatment, control]
    - [high_fat, standard]

quality_control

Parameters for filtering samples and positions.

Parameter

Default

Description

min_sample_num

4

Minimum number of samples per group required for statistical tests. MAGs with fewer valid samples are marked as ineligible.

breadth_threshold

0.1

Minimum breadth of coverage (fraction of genome with ≥1x coverage). Range: 0.0-1.0.

coverage_threshold

1.0

Minimum average coverage depth required. Samples below this are excluded.

disable_zero_diff_filtering

false

If true, keep positions where allele frequencies do not change. By default, constant positions are filtered out.

Example:

quality_control:
  min_sample_num: 4
  breadth_threshold: 0.1
  coverage_threshold: 1.0
  disable_zero_diff_filtering: false

profiling

Parameters for BAM file processing during profiling.

Parameter

Default

Description

ignore_orphans

true

Ignore orphan reads (unpaired reads). Set to false to include unpaired reads.

min_base_quality

30

Minimum Phred base quality score to include a base in the pileup.

min_mapping_quality

2

Minimum mapping quality (MAPQ) score to include a read.

ignore_overlaps

true

Ignore overlapping segments of read pairs to avoid double-counting.

Example:

profiling:
  ignore_orphans: true
  min_base_quality: 30
  min_mapping_quality: 2
  ignore_overlaps: true

Note

Higher min_base_quality values (e.g., 30) reduce sequencing errors but may also reduce coverage. For high-quality data, the default of 30 is recommended.


statistics

Parameters for statistical testing.

Parameter

Default

Description

filter_type

t-test

Type of initial filter for preprocessing positions.

preprocess_between_groups

true

Enable preprocessing for between-group comparisons.

preprocess_within_groups

true

Enable preprocessing for within-group comparisons.

max_zero_count

4

Maximum number of zero-frequency samples allowed per position in preprocessing.

p_value_threshold

0.05

Significance threshold (alpha) for statistical tests.

fdr_group_by_mag_id

false

If true, apply FDR correction within each MAG. If false, apply across all positions.

min_positions_after_preprocess

1

Minimum number of positions required after preprocessing to proceed with analysis.

Example:

statistics:
  filter_type: t-test
  preprocess_between_groups: true
  preprocess_within_groups: true
  max_zero_count: 4
  p_value_threshold: 0.05
  fdr_group_by_mag_id: false
  min_positions_after_preprocess: 1

dnds

Parameters for dN/dS (synonymous/non-synonymous) ratio calculations.

Parameter

Default

Description

p_value_column

q_value

Column name to use for significance in dN/dS calculations.

dn_ds_test_type

two_sample_unpaired_tTest

Type of statistical test to use for dN/dS analysis.

Example:

dnds:
  p_value_column: q_value
  dn_ds_test_type: two_sample_unpaired_tTest

resources

Computational resource allocation for cluster execution.

Parameter

Default

Description

threads_per_job

16

Number of CPU threads allocated to each job.

mem_per_job

8G

Memory allocation per job. Formats: 8G, 16GB, 8192M.

time

24:00:00

Maximum wall time per job in HH:MM:SS format.

Example:

resources:
  threads_per_job: 16
  mem_per_job: 8G
  time: '24:00:00'

Complete Configuration Example

run_name: diet_microbiome_study

input:
  fasta_path: /data/mags/combined_mags.fasta
  prodigal_path: /data/mags/prodigal_genes.fna
  metadata_path: /data/metadata/samples_with_bam.tsv
  gtdb_path: /data/taxonomy/gtdbtk.bac120.summary.tsv
  mag_mapping_path: /data/mags/mag_mapping.tsv

output:
  root_dir: ./results

log_level: INFO

analysis:
  data_type: longitudinal
  allele_analysis_only: false
  use_lmm: true
  use_significance_tests: true
  use_cmh: true
  timepoints_combinations:
    - timepoint: [pre, post]
      focus: post
  groups_combinations:
    - [high_fat, control]

quality_control:
  min_sample_num: 4
  breadth_threshold: 0.1
  coverage_threshold: 1.0
  disable_zero_diff_filtering: false

profiling:
  ignore_orphans: true
  min_base_quality: 30
  min_mapping_quality: 2
  ignore_overlaps: true

statistics:
  filter_type: t-test
  preprocess_between_groups: true
  preprocess_within_groups: true
  max_zero_count: 4
  p_value_threshold: 0.05
  fdr_group_by_mag_id: false
  min_positions_after_preprocess: 1

dnds:
  p_value_column: q_value
  dn_ds_test_type: two_sample_unpaired_tTest

resources:
  threads_per_job: 16
  mem_per_job: 8G
  time: '24:00:00'

Quick Tips

  • breadth_threshold: Start with 0.1 (10% coverage); increase for high-coverage data

  • min_sample_num: Minimum 4 samples per group for robust inference

  • min_base_quality: Keep at 30 for Illumina; lower to 20 for older data

  • Resource allocation: Adjust threads_per_job and mem_per_job based on MAG sizes

For worked examples, see Use Cases.