Tutorial

This walkthrough uses the bundled example dataset in docs/source/examples/example_data plus the ready-made config_example.yml. It shows how to dry-run, execute, and inspect the outputs for a single MAG/timepoint/group. Micro-examples at the end exercise individual tools on the tiny mock datasets in tests/evolution/mock_data.

Prerequisites

  • AlleleFlux installed and on $PATH (see Installation).

  • Working from the repo root (AlleleFlux/) so relative paths resolve.

Step 1: Start from the template config

Option A: copy the example config (pre-populated for the bundled data):

cp docs/source/examples/example_data/config_example.yml ./config_example.yml

Option B: print the template then edit:

alleleflux init --template > config_example.yml

Open config_example.yml and confirm the paths point at docs/source/examples/example_data. The important bits (already set in the example):

input:
  fasta_path: docs/source/examples/example_data/reference/combined_mags.fasta
  prodigal_path: docs/source/examples/example_data/reference/prodigal_genes.fna
  metadata_path: docs/source/examples/example_data/metadata/sample_metadata.tsv
  gtdb_path: docs/source/examples/example_data/reference/gtdbtk_taxonomy.tsv
  mag_mapping_path: docs/source/examples/example_data/reference/mag_mapping.tsv
output:
  root_dir: ./example_output
analysis:
  data_type: longitudinal
  timepoints_combinations:
    - timepoint: ["pre", "post"]
      focus: "post"
  groups_combinations:
    - ["treatment", "control"]

Step 2: Dry-run the workflow

alleleflux run --config config_example.yml --dry-run

This builds the DAG and confirms inputs without running jobs.

Step 3: Run the example end-to-end

# modest local resources to keep the run quick
alleleflux run --config config_example.yml --threads 4 --memory 8G

Outputs land under example_output/longitudinal (the pipeline appends data_type).

Step 4: Inspect one MAG/timepoint/group

The example uses the label pre_post for timepoints and treatment_control for groups. Here is a quick inspection for TEST_MAG_001:

OUT=example_output/longitudinal

# QC eligibility table that drives downstream targets
column -t -s $'\t' $OUT/eligibility_table_pre_post-treatment_control.tsv | head

# Allele analysis outputs (mean changes and zero-diff filtered)
ls $OUT/allele_analysis/allele_analysis_pre_post-treatment_control \
   | grep TEST_MAG_001
zcat $OUT/allele_analysis/allele_analysis_pre_post-treatment_control/TEST_MAG_001_allele_frequency_changes_mean.tsv.gz \
     | head

# Significance tests for this MAG/timepoint/group
ls $OUT/significance_tests/two_sample_paired_pre_post-treatment_control \
   | grep TEST_MAG_001

# Combined MAG-level scores
column -t -s $'\t' \
  $OUT/scores/processed/combined/MAG/scores_two_sample_paired-pre_post-treatment_control-MAGs.tsv \
  | head

# dN/dS per subject (longitudinal only)
ls $OUT/dnds_analysis/pre_post-treatment_control

Step 5: Micro-examples with mock data

These commands use the tiny bundles in tests/evolution/mock_data so you can exercise individual tools without running the full workflow.

alleleflux-preprocess-between-groups (paired filter on mock mean changes)

Create a minimal mean-changes table from the bundled pre/post profiles for MAG_001 and run the filter:

python - <<'PY'
import gzip, pandas as pd

pre = pd.read_csv("tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles/pre_sample/pre_sample_MAG_001_profiled.tsv.gz", sep="\t")
post = pd.read_csv("tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles/post_sample/post_sample_MAG_001_profiled.tsv.gz", sep="\t")

# compute per-position mean change for each nucleotide (post - pre) normalized by coverage
for df, label in [(pre, "pre"), (post, "post")]:
    for base in ["A", "T", "G", "C"]:
        df[f"{base}_frequency"] = df[base] / df["total_coverage"].clip(lower=1)
    df["timepoint"] = label

merged = pre.merge(
    post,
    on=["contig", "position", "gene_id", "ref_base"],
    suffixes=("_pre", "_post"),
)
out_cols = ["contig", "position", "gene_id"]
for base in ["A", "T", "G", "C"]:
    merged[f"{base}_frequency"] = merged[f"{base}_frequency_post"] - merged[f"{base}_frequency_pre"]
    out_cols.append(f"{base}_frequency")

merged[out_cols].to_csv("/tmp/mock_mean_changes.tsv", sep="\t", index=False)
print("Wrote /tmp/mock_mean_changes.tsv with", len(merged), "rows")
PY

alleleflux-preprocess-between-groups \
  --mean_changes_fPath /tmp/mock_mean_changes.tsv \
  --output_fPath /tmp/mock_mean_changes_preprocessed.tsv \
  --p_value_threshold 0.05 \
  --data_type longitudinal \
  --filter_type t-test \
  --mag_id MAG_001 \
  --min_positions 1 \
  --min_sample_num 1 \
  --status_dir /tmp

Outputs: /tmp/mock_mean_changes_preprocessed.tsv (filtered sites) and /tmp/MAG_001_preprocessing_status.json with eligibility counts.

alleleflux-dnds-from-timepoints (mock significant sites → dN/dS)

Use the mock significant sites and profiles shipped with the tests:

alleleflux-dnds-from-timepoints \
  --input tests/evolution/mock_data/mock_dnds_1/PipelineMock/significant_sites.tsv \
  --output /tmp/mock_dnds \
  --mag_id MAG_001 \
  --profiles_dir tests/evolution/mock_data/mock_dnds_1/PipelineMock/profiles \
  --prodigal_fasta tests/evolution/mock_data/mock_dnds_1/PipelineMock/prodigal_genes.fasta \
  --fasta tests/evolution/mock_data/mock_dnds_1/PipelineMock/prodigal_genes.fasta \
  --p_value_column q_value \
  --p_value_threshold 0.05 \
  --test_type two_sample_paired_tTest

Outputs (under /tmp/mock_dnds): codon, gene, MAG, and global NG86 summaries ready to inspect or plot.