# Tutorial This tutorial uses the bundled synthetic dataset in `docs/source/examples/example_data` to run AlleleFlux end-to-end on a normal desktop computer. No HPC access or GPU is required. --- ## Dataset Overview | Property | Value | |---|---| | MAGs | 2 (`MAG_001`, `MAG_002`) | | Samples | 8 (4 control, 4 treatment; 2 subjects × 2 timepoints each) | | Design | Longitudinal (pre → post) | | Genome size | ~3,600–3,650 positions per MAG (2 contigs each) | A simulated treatment effect is present in the pre-generated profiles: at ~8% of positions in treatment post-timepoint samples, the dominant allele is shifted to an alternative base. This signal is what AlleleFlux is designed to detect. --- ## Prerequisites AlleleFlux installed (see [Installation](../getting_started/installation.md)): ```bash conda activate alleleflux ``` Clone the repository and navigate to the example directory: ```bash git clone https://github.com/MoellerLabPU/AlleleFlux.git cd AlleleFlux/docs/source/examples/example_data ``` --- ## Directory Structure ``` example_data/ ├── config_example.yml # Run from pre-generated profiles (Option A) ├── config_with_bams.yml # Run from BAM files, includes profiling (Option B) ├── generate_bams.sh # Regenerate BAMs from the reference FASTA ├── reference/ │ ├── combined_mags.fasta │ ├── prodigal_genes.fna │ ├── mag_mapping.tsv │ └── gtdbtk_taxonomy.tsv ├── metadata/ │ ├── sample_metadata.tsv # For Option A │ └── sample_metadata_bams.tsv # For Option B ├── profiles/ # Pre-generated profiles with treatment effect │ └── {sample}/{sample}_{MAG}_profiled.tsv.gz ├── bams/ # Synthetic BAM files (~940 KB) │ └── {sample}.bam[.bai] └── significant_sites/ └── significant_sites.tsv ``` --- ## Option A — Pre-generated profiles (recommended) The profiling step is skipped; pre-generated profiles are loaded directly. These profiles were created with a treatment effect baked in (see [Data generation details](#data-generation-details)), making the statistical results meaningful even with a small dataset. > **Caveat:** The profiles are synthetically generated, not derived from real > BAM files. They are designed to demonstrate statistical detection, not to > reflect real sequencing noise. ### Dry-run ```bash alleleflux run --config config_example.yml --dry-run --threads 4 ``` Expected: Snakemake prints 4 initial jobs (metadata, QC, eligibility, all) and notes that checkpoint-based jobs will expand the DAG after running. No files are created. ### Run end-to-end ```bash alleleflux run --config config_example.yml --threads 4 ``` Outputs land under `example_output/longitudinal/`. ### Inspect outputs ```bash OUT=example_output/longitudinal # QC eligibility table — both MAGs pass with 2 replicates/group column -t -s $'\t' $OUT/eligibility_table_pre_post-treatment_control.tsv # Allele frequency changes for MAG_001 zcat $OUT/allele_analysis/allele_analysis_pre_post-treatment_control/MAG_001_allele_frequency_changes_mean.tsv.gz | head -5 # Paired t-test results for MAG_001 zcat $OUT/significance_tests/two_sample_paired_pre_post-treatment_control/MAG_001_two_sample_paired.tsv.gz | head -5 # MAG-level parallel-evolution scores by phylum column -t -s $'\t' \ $OUT/scores/processed/combined/phylum/scores_two_sample_paired-pre_post-treatment_control-phylum.tsv # dN/dS output directories (one per subject pair) ls $OUT/dnds_analysis/pre_post-treatment_control/ ``` --- ## Option B — Full pipeline from BAM files Runs the complete pipeline including the profiling step, starting from the bundled synthetic BAM files. > **Caveat:** The BAMs were generated with `wgsim` using the same parameters for > all samples — there is **no treatment effect** at the read level. Any > statistically significant sites in this run are artefacts of the small n rather > than real biological signal. This option is provided to demonstrate that the > profiling step runs correctly, not to demonstrate statistical detection. > > Additionally, with only 2 subjects per group, the paired longitudinal analysis > yields n=2 paired differences per position. Non-parametric tests (Wilcoxon > signed-rank, Mann-Whitney) require n≥5 to achieve p<0.05, so they will report > zero significant sites. The paired t-test can produce sub-0.05 p-values at n=2 > but those results are statistically unreliable at this sample size. ### Run end-to-end ```bash alleleflux run --config config_with_bams.yml --threads 4 ``` Outputs land under `example_output_bams/longitudinal/`, including a `profiles/` subdirectory with profiles generated from the BAMs. --- ## Expected Output Structure ``` longitudinal/ ├── eligibility_table_pre_post-treatment_control.tsv # Both MAGs: all tests eligible ├── QC/ # Per-sample coverage stats ├── allele_analysis/ # Allele frequency changes per position │ └── allele_analysis_pre_post-treatment_control/ ├── significance_tests/ # Per-position p-values │ ├── two_sample_paired_pre_post-treatment_control/ │ ├── two_sample_unpaired_pre_post-treatment_control/ │ └── single_sample_pre_post-treatment_control/ ├── p_value_summary/ # FDR-corrected q-values ├── scores/ # Parallel-evolution scores by taxonomy │ └── processed/combined/{phylum,genus,MAG}/ ├── outlier_genes/ # Per-gene outlier scores ├── dnds_analysis/ # dN/dS per subject pair │ └── pre_post-treatment_control/{subj1,subj2,subj3,subj4}/ └── preprocessing_eligibility/ # Positions passing preprocessing filters ``` --- ## Expected Run Time Measured on a node pinned to 4 CPU cores: | Option | Config | Wall time | Peak memory | |---|---|---|---| | A — Pre-generated profiles | `config_example.yml` | ~94 s | < 150 MB | | B — Full pipeline from BAMs | `config_with_bams.yml` | ~74 s | < 150 MB | Option B is faster in this demo because the wgsim-derived BAMs produce sparser profiles (not every position achieves uniform coverage), so there is less data in downstream steps. In a real experiment with whole-genome sequencing data, profiling is the most computationally intensive step and will dominate runtime. --- ## Regenerating the BAM Files To rebuild the BAMs from the reference FASTA: ```bash bash generate_bams.sh ``` This uses the `reference/combined_mags.fasta` that is already committed in the repository. `wgsim` and `bwa` must be installed (see above). `wgsim` and `bwa` are **not** included in the `alleleflux` conda environment. Install them if needed: ```bash conda install -c bioconda bwa wgsim ``` ## Data Generation Details The pre-generated profiles were created with [`generate_synthetic_data.py`](generate_synthetic_data.py) (seed = 42): - Each MAG has 2 contigs (~800–2,000 bp, 3 genes per contig) - Coverage drawn uniformly from 20–80× at every position - The dominant allele receives 85–98% of reads; the remainder are split randomly among the other three bases **Simulated treatment effect** (lines 364–366 of `generate_synthetic_data.py`): ```python if is_treatment and is_post and random.random() < treatment_effect_rate: dominant_base = BASES[(pos % 4 + 1) % 4] ``` In treatment post-timepoint samples only, each position has an 8% probability of shifting its dominant allele to the next base in the cycle (e.g., A → T). This creates real but modest allele frequency differences between groups that AlleleFlux is designed to detect. The BAMs were generated with `wgsim` (100 bp paired-end, 0.5% error rate, unique seed per sample) and aligned with `bwa mem`. All samples were generated from the same reference with the same parameters, so no treatment effect is present at the read level. --- ## File Format Reference ### Profile files (`*_profiled.tsv.gz`) | Column | Type | Description | |---|---|---| | `contig` | string | Contig identifier | | `position` | int | 0-based genomic position | | `ref_base` | string | Reference base (A/C/G/T) | | `total_coverage` | int | Total read depth | | `A`, `C`, `G`, `T` | int | Per-base read counts | | `N` | int | Ambiguous base count | | `gene_id` | string | Overlapping gene ID (empty if intergenic) | ### Sample metadata | Column | Description | |---|---| | `sample_id` | Unique sample identifier | | `bam_path` | Path to BAM file or pre-generated profile directory | | `subjectID` | Biological replicate / subject ID | | `group` | Experimental group (`control` / `treatment`) | | `replicate` | Replicate letter within group | | `time` | Timepoint (`pre` / `post`) |