# Tutorial This walkthrough uses the bundled example dataset in `docs/source/examples/example_data` plus the ready-made `config_example.yml`. It shows how to dry-run, execute, and inspect the outputs for a single MAG/timepoint/group. Micro-examples at the end exercise individual tools on the tiny mock datasets in `tests/evolution/mock_data`. ## Prerequisites - AlleleFlux installed and on `$PATH` (see [Installation](../getting_started/installation.md)). - Working from the repo root (`AlleleFlux/`) so relative paths resolve. ## Step 1: Start from the template config Option A: copy the example config (pre-populated for the bundled data): ```bash cp docs/source/examples/example_data/config_example.yml ./config_example.yml ``` Option B: print the template then edit: ```bash alleleflux init --template > config_example.yml ``` Open `config_example.yml` and confirm the paths point at `docs/source/examples/example_data`. The important bits (already set in the example): ```yaml input: fasta_path: docs/source/examples/example_data/reference/combined_mags.fasta prodigal_path: docs/source/examples/example_data/reference/prodigal_genes.fna metadata_path: docs/source/examples/example_data/metadata/sample_metadata.tsv gtdb_path: docs/source/examples/example_data/reference/gtdbtk_taxonomy.tsv mag_mapping_path: docs/source/examples/example_data/reference/mag_mapping.tsv output: root_dir: ./example_output analysis: data_type: longitudinal timepoints_combinations: - timepoint: ["pre", "post"] focus: "post" groups_combinations: - ["treatment", "control"] ``` ## Step 2: Dry-run the workflow ```bash alleleflux run --config config_example.yml --dry-run ``` This builds the DAG and confirms inputs without running jobs. ## Step 3: Run the example end-to-end ```bash # modest local resources to keep the run quick alleleflux run --config config_example.yml --threads 4 --memory 8G ``` Outputs land under `example_output/longitudinal` (the pipeline appends `data_type`). ## Step 4: Inspect one MAG/timepoint/group The example uses the label `pre_post` for timepoints and `treatment_control` for groups. Here is a quick inspection for `TEST_MAG_001`: ```bash OUT=example_output/longitudinal # QC eligibility table that drives downstream targets column -t -s $'\t' $OUT/eligibility_table_pre_post-treatment_control.tsv | head # Allele analysis outputs (mean changes and zero-diff filtered) ls $OUT/allele_analysis/allele_analysis_pre_post-treatment_control \ | grep TEST_MAG_001 zcat $OUT/allele_analysis/allele_analysis_pre_post-treatment_control/TEST_MAG_001_allele_frequency_changes_mean.tsv.gz \ | head # Significance tests for this MAG/timepoint/group ls $OUT/significance_tests/two_sample_paired_pre_post-treatment_control \ | grep TEST_MAG_001 # Combined MAG-level scores column -t -s $'\t' \ $OUT/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv \ | head # dN/dS per subject (longitudinal only) ls $OUT/dnds_analysis/pre_post-treatment_control ``` ## Step 5: Micro-examples with mock data These commands use the tiny bundles in `tests/evolution/mock_data` so you can exercise individual tools without running the full workflow. ### `alleleflux-preprocess-between-groups` (paired filter on mock mean changes) Create a minimal mean-changes table from the bundled pre/post profiles for `MAG_001` and run the filter: ```bash python - <<'PY' import gzip, pandas as pd pre = pd.read_csv("tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles/pre_sample/pre_sample_MAG_001_profiled.tsv.gz", sep="\t") post = pd.read_csv("tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles/post_sample/post_sample_MAG_001_profiled.tsv.gz", sep="\t") # compute per-position mean change for each nucleotide (post - pre) normalized by coverage for df, label in [(pre, "pre"), (post, "post")]: for base in ["A", "T", "G", "C"]: df[f"{base}_frequency"] = df[base] / df["total_coverage"].clip(lower=1) df["timepoint"] = label merged = pre.merge( post, on=["contig", "position", "gene_id", "ref_base"], suffixes=("_pre", "_post"), ) out_cols = ["contig", "position", "gene_id"] for base in ["A", "T", "G", "C"]: merged[f"{base}_frequency"] = merged[f"{base}_frequency_post"] - merged[f"{base}_frequency_pre"] out_cols.append(f"{base}_frequency") merged[out_cols].to_csv("/tmp/mock_mean_changes.tsv", sep="\t", index=False) print("Wrote /tmp/mock_mean_changes.tsv with", len(merged), "rows") PY alleleflux-preprocess-between-groups \ --mean_changes_fPath /tmp/mock_mean_changes.tsv \ --output_fPath /tmp/mock_mean_changes_preprocessed.tsv \ --p_value_threshold 0.05 \ --data_type longitudinal \ --filter_type t-test \ --mag_id MAG_001 \ --min_positions 1 \ --min_sample_num 1 \ --status_dir /tmp ``` Outputs: `/tmp/mock_mean_changes_preprocessed.tsv` (filtered sites) and `/tmp/MAG_001_preprocessing_status.json` with eligibility counts. ### `alleleflux-dnds-from-timepoints` (mock significant sites → dN/dS) Use the mock significant sites and profiles shipped with the tests: ```bash alleleflux-dnds-from-timepoints \ --input tests/evolution/mock_data/mock_dnds_1/PipelineMock/significant_sites.tsv \ --output /tmp/mock_dnds \ --mag_id MAG_001 \ --profiles_dir tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles \ --prodigal_fasta tests/evolution/mock_data/mock_dnds_1/PipelineMock/prodigal_genes.fasta \ --fasta tests/evolution/mock_data/mock_dnds_1/PipelineMock/prodigal_genes.fasta \ --p_value_column q_value \ --p_value_threshold 0.05 \ --test_type two_sample_paired_tTest ``` Outputs (under `/tmp/mock_dnds`): codon, gene, MAG, and global NG86 summaries ready to inspect or plot.