Tutorial¶
This walkthrough uses the bundled example dataset in docs/source/examples/example_data plus the ready-made config_example.yml. It shows how to dry-run, execute, and inspect the outputs for a single MAG/timepoint/group. Micro-examples at the end exercise individual tools on the tiny mock datasets in tests/evolution/mock_data.
Prerequisites¶
AlleleFlux installed and on
$PATH(see Installation).Working from the repo root (
AlleleFlux/) so relative paths resolve.
Step 1: Start from the template config¶
Option A: copy the example config (pre-populated for the bundled data):
cp docs/source/examples/example_data/config_example.yml ./config_example.yml
Option B: print the template then edit:
alleleflux init --template > config_example.yml
Open config_example.yml and confirm the paths point at docs/source/examples/example_data. The important bits (already set in the example):
input:
fasta_path: docs/source/examples/example_data/reference/combined_mags.fasta
prodigal_path: docs/source/examples/example_data/reference/prodigal_genes.fna
metadata_path: docs/source/examples/example_data/metadata/sample_metadata.tsv
gtdb_path: docs/source/examples/example_data/reference/gtdbtk_taxonomy.tsv
mag_mapping_path: docs/source/examples/example_data/reference/mag_mapping.tsv
output:
root_dir: ./example_output
analysis:
data_type: longitudinal
timepoints_combinations:
- timepoint: ["pre", "post"]
focus: "post"
groups_combinations:
- ["treatment", "control"]
Step 2: Dry-run the workflow¶
alleleflux run --config config_example.yml --dry-run
This builds the DAG and confirms inputs without running jobs.
Step 3: Run the example end-to-end¶
# modest local resources to keep the run quick
alleleflux run --config config_example.yml --threads 4 --memory 8G
Outputs land under example_output/longitudinal (the pipeline appends data_type).
Step 4: Inspect one MAG/timepoint/group¶
The example uses the label pre_post for timepoints and treatment_control for groups. Here is a quick inspection for TEST_MAG_001:
OUT=example_output/longitudinal
# QC eligibility table that drives downstream targets
column -t -s $'\t' $OUT/eligibility_table_pre_post-treatment_control.tsv | head
# Allele analysis outputs (mean changes and zero-diff filtered)
ls $OUT/allele_analysis/allele_analysis_pre_post-treatment_control \
| grep TEST_MAG_001
zcat $OUT/allele_analysis/allele_analysis_pre_post-treatment_control/TEST_MAG_001_allele_frequency_changes_mean.tsv.gz \
| head
# Significance tests for this MAG/timepoint/group
ls $OUT/significance_tests/two_sample_paired_pre_post-treatment_control \
| grep TEST_MAG_001
# Combined MAG-level scores
column -t -s $'\t' \
$OUT/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv \
| head
# dN/dS per subject (longitudinal only)
ls $OUT/dnds_analysis/pre_post-treatment_control
Step 5: Micro-examples with mock data¶
These commands use the tiny bundles in tests/evolution/mock_data so you can exercise individual tools without running the full workflow.
alleleflux-preprocess-between-groups (paired filter on mock mean changes)¶
Create a minimal mean-changes table from the bundled pre/post profiles for MAG_001 and run the filter:
python - <<'PY'
import gzip, pandas as pd
pre = pd.read_csv("tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles/pre_sample/pre_sample_MAG_001_profiled.tsv.gz", sep="\t")
post = pd.read_csv("tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles/post_sample/post_sample_MAG_001_profiled.tsv.gz", sep="\t")
# compute per-position mean change for each nucleotide (post - pre) normalized by coverage
for df, label in [(pre, "pre"), (post, "post")]:
for base in ["A", "T", "G", "C"]:
df[f"{base}_frequency"] = df[base] / df["total_coverage"].clip(lower=1)
df["timepoint"] = label
merged = pre.merge(
post,
on=["contig", "position", "gene_id", "ref_base"],
suffixes=("_pre", "_post"),
)
out_cols = ["contig", "position", "gene_id"]
for base in ["A", "T", "G", "C"]:
merged[f"{base}_frequency"] = merged[f"{base}_frequency_post"] - merged[f"{base}_frequency_pre"]
out_cols.append(f"{base}_frequency")
merged[out_cols].to_csv("/tmp/mock_mean_changes.tsv", sep="\t", index=False)
print("Wrote /tmp/mock_mean_changes.tsv with", len(merged), "rows")
PY
alleleflux-preprocess-between-groups \
--mean_changes_fPath /tmp/mock_mean_changes.tsv \
--output_fPath /tmp/mock_mean_changes_preprocessed.tsv \
--p_value_threshold 0.05 \
--data_type longitudinal \
--filter_type t-test \
--mag_id MAG_001 \
--min_positions 1 \
--min_sample_num 1 \
--status_dir /tmp
Outputs: /tmp/mock_mean_changes_preprocessed.tsv (filtered sites) and /tmp/MAG_001_preprocessing_status.json with eligibility counts.
alleleflux-dnds-from-timepoints (mock significant sites → dN/dS)¶
Use the mock significant sites and profiles shipped with the tests:
alleleflux-dnds-from-timepoints \
--input tests/evolution/mock_data/mock_dnds_1/PipelineMock/significant_sites.tsv \
--output /tmp/mock_dnds \
--mag_id MAG_001 \
--profiles_dir tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles \
--prodigal_fasta tests/evolution/mock_data/mock_dnds_1/PipelineMock/prodigal_genes.fasta \
--fasta tests/evolution/mock_data/mock_dnds_1/PipelineMock/prodigal_genes.fasta \
--p_value_column q_value \
--p_value_threshold 0.05 \
--test_type two_sample_paired_tTest
Outputs (under /tmp/mock_dnds): codon, gene, MAG, and global NG86 summaries ready to inspect or plot.