Visualization Guide

AlleleFlux provides a complete visualization workflow for exploring allele frequency dynamics across samples and timepoints. This guide walks through the four-step visualization pipeline, from identifying significant sites to generating publication-ready plots.

Note

The visualization workflow requires completing the main AlleleFlux pipeline first to generate significant sites. Alternatively, you can use the provided example data to learn the workflow.


Visualization Workflow Overview

The visualization pipeline consists of four sequential steps:

1. Prepare Metadata      →  Standardize sample metadata with profile paths
2. Terminal Nucleotide   →  Identify terminal (endpoint) alleles at significant sites
3. Track Alleles         →  Build frequency tables for anchor alleles across samples
4. Plot Trajectories     →  Generate line plots, box plots, and violin plots

Each step produces output files that serve as input to the next step.


Step 1: Prepare Metadata

Command: alleleflux-prepare-metadata

This step standardizes your metadata and adds profile file paths, creating a unified metadata file for the visualization workflow.

Input Requirements

Original Metadata File (TSV):

Column

Required

Description

sample_id

Yes

Unique sample identifier

group

Yes

Experimental group (e.g., “treatment”, “control”)

time

Yes

Timepoint identifier (e.g., “pre”, “post”, “day1”)

subjectID

Yes

Biological replicate/subject ID for pairing samples

day

No

Numeric day for continuous time axis

replicate

No

Replicate identifier within group

Profile Directory Structure:

profiles/
├── sample1/
│   ├── sample1_MAG_001_profiled.tsv.gz
│   └── sample1_MAG_002_profiled.tsv.gz
├── sample2/
│   └── ...

Usage

alleleflux-prepare-metadata \
    --metadata-in original_metadata.tsv \
    --metadata-out visualization_metadata.tsv \
    --base-profile-dir /path/to/profiles \
    --sample-col sample_id \
    --group-col group \
    --time-col timepoint

Key Arguments:

Argument

Default

Description

--metadata-in

Required

Input metadata file

--metadata-out

Required

Output standardized metadata (appends if exists)

--base-profile-dir

Required

Base directory containing sample profile subdirectories

--sample-col

sample_id

Column name for sample IDs in input

--group-col

group

Column name for experimental groups

--time-col

time

Column name for timepoints

Output

Standardized Metadata (visualization_metadata.tsv):

sample_id    group       time    subjectID    file_path
sample1      treatment   pre     subj1        /path/to/profiles/sample1
sample2      treatment   post    subj1        /path/to/profiles/sample2
sample3      control     pre     subj2        /path/to/profiles/sample3
sample4      control     post    subj2        /path/to/profiles/sample4

Step 2: Terminal Nucleotide Analysis

Command: alleleflux-terminal-nucleotide

This step identifies the “terminal” (endpoint) nucleotide at each significant genomic site. Two methods are used:

  1. Mean Frequency Method: Calculates mean allele frequencies and selects the most frequent allele

  2. Majority Vote Method: Each sample votes for its dominant allele; the most-voted allele wins

Input Requirements

Significant Sites File (TSV):

Column

Required

Description

mag_id

Yes

MAG identifier

contig

Yes

Contig identifier

position

Yes

0-based genomic position

gene_id

Yes

Gene identifier

min_p_value

Yes*

Minimum p-value (used for filtering)

q_value

Yes*

FDR-adjusted p-value (used for filtering)

At least one of min_p_value or q_value required, depending on --p_value_column setting.

Example significant_sites.tsv:

mag_id      contig                  position    gene_id                     test_type               min_p_value    q_value
MAG_001     MAG_001.fa_contig1      120         MAG_001.fa_contig1_gene1    two_sample_paired_tTest 1.5e-06        2.3e-05
MAG_001     MAG_001.fa_contig1      145         MAG_001.fa_contig1_gene1    two_sample_paired_tTest 3.2e-05        1.8e-04

Usage

alleleflux-terminal-nucleotide \
    --significant_sites p_value_summary.tsv \
    --profile_dir profiles/ \
    --metadata visualization_metadata.tsv \
    --group treatment \
    --timepoint post \
    --output results/terminal/ \
    --p_value_column q_value \
    --p_value_threshold 0.05 \
    --cpus 16

Key Arguments:

Argument

Default

Description

--significant_sites

Required

Path to significant sites table

--profile_dir

Required

Directory containing sample profile subdirectories

--metadata

Required

Standardized metadata file from Step 1

--group

Required

Target group for terminal nucleotide calculation

--timepoint

Required

Target timepoint (typically endpoint)

--output

Required

Output directory

--p_value_column

q_value

Column for filtering: min_p_value or q_value

--p_value_threshold

0.05

Maximum p-value to include site

Output Files

Per-MAG Terminal Nucleotides ({output}/{mag_id}/{mag_id}_terminal_nucleotides.tsv):

Column

Description

contig

Contig identifier

position

Genomic position

gene_id

Gene identifier

terminal_nucleotide_mean_freq

Terminal allele (mean frequency method)

terminal_nucleotide_majority_vote

Terminal allele (majority vote method)

min_p_value, q_value

Original significance values

Full Frequency Data ({mag_id}_frequencies.tsv): Complete allele frequencies for all samples at each site.

Summary File (terminal_nucleotide_analysis_summary.tsv): Aggregated results across all MAGs.


Step 3: Track Alleles

Command: alleleflux-track-alleles

This step creates frequency tables that track the “anchor” allele (terminal nucleotide) across all samples and timepoints. The output is formatted for direct use in plotting.

Input Requirements

  • Anchor File: Output from Step 2 (terminal nucleotides)

  • Enhanced Metadata: Metadata with file_path column from Step 1

Usage

alleleflux-track-alleles \
    --mag-id MAG_001 \
    --anchor-file results/terminal/MAG_001/MAG_001_terminal_nucleotides.tsv \
    --metadata visualization_metadata.tsv \
    --output-dir results/tracking/ \
    --anchor-column terminal_nucleotide_mean_freq \
    --min-cov-per-site 5 \
    --cpus 16

Key Arguments:

Argument

Default

Description

--mag-id

Required

MAG identifier to process

--anchor-file

Required

Terminal nucleotides file from Step 2

--metadata

Required

Enhanced metadata with file_path column

--output-dir

Required

Output directory

--anchor-column

terminal_nucleotide_mean_freq

Which anchor method to use

--min-cov-per-site

0

Minimum coverage required (sites below excluded)

Output Files

Wide-Format Frequency Table ({mag_id}_frequency_table.wide.tsv):

Sites as rows, samples as columns:

contig    position    gene_id    anchor_allele    sample1_pre    sample1_post    sample2_pre    sample2_post
contig1   120         gene1      A                0.92           0.45            0.88           0.42
contig1   145         gene1      G                0.85           0.72            0.91           0.68

Long-Format Frequency Table ({mag_id}_frequency_table.long.tsv):

Tidy format for R/ggplot2:

contig    position    gene_id    anchor_allele    sample_id    frequency    group       time    subjectID
contig1   120         gene1      A                sample1      0.92         treatment   pre     subj1
contig1   120         gene1      A                sample1      0.45         treatment   post    subj1
contig1   120         gene1      A                sample2      0.88         control     pre     subj2

Step 4: Plot Allele Trajectories

Command: alleleflux-plot-trajectories

This step generates visualization plots showing allele frequency trajectories. Multiple plot types are available:

  • Line Plots: Track individual site trajectories over time

  • Box Plots: Show distribution of frequencies at each timepoint

  • Violin Plots: Show frequency distributions with density

Input Requirements

Long-Format Frequency Table (from Step 3):

Required columns:

Column

Description

contig, position, anchor_allele

Site identification

frequency

Allele frequency value (0-1)

group

Experimental group

subjectID

Subject/replicate identifier

time or day

Temporal information for x-axis

Usage

alleleflux-plot-trajectories \
    --input_file results/tracking/MAG_001_frequency_table.long.tsv \
    --value_col q_value \
    --n_sites_line 20 \
    --n_sites_dist all \
    --x_col time \
    --x_order pre post \
    --plot_types line box violin \
    --per_site \
    --n_sites_per_site 10 \
    --output_dir plots/MAG_001 \
    --output_format pdf \
    --group_by_replicate

Key Arguments:

Argument

Default

Description

--input_file

Required

Long-format frequency table from Step 3

--value_col

min_p_value

Column for ranking sites: min_p_value, q_value

--n_sites_line

10

Number of top sites for line plots (or “all”)

--n_sites_dist

all

Number of sites for box/violin plots

--x_col

time

X-axis column: time or day

--x_order

None

Custom x-axis order (space-separated)

--plot_types

line

Plot types: line, box, violin

--per_site

False

Generate individual plots per site

--n_sites_per_site

None

Number of sites for per-site plots

--output_dir

./plots

Output directory

--output_format

png

Format: png, pdf, svg

--group_by_replicate

False

Aggregate trajectories by replicate

Advanced Options

Day Binning (for continuous time data):

--x_col day --bin_width 7 --min_samples_per_bin 3

Custom Styling:

--line_alpha 0.6 --output_format svg

Output Files

Combined Plots:

  • {mag_id}_line_plot.{format} - Line trajectories for top N sites

  • {mag_id}_box_plot.{format} - Box plots across timepoints

  • {mag_id}_violin_plot.{format} - Violin plots with density

Per-Site Plots (when --per_site enabled):

  • per_site/{contig}_{position}_{gene}_line.{format}

  • per_site/{contig}_{position}_{gene}_by_replicate.{format}


Example: Complete Visualization Workflow

Using the provided example data:

# Navigate to example data directory
cd docs/source/examples/example_data

# Step 1: Prepare metadata (already done - metadata provided)
# alleleflux-prepare-metadata ...

# Step 2: Terminal nucleotide analysis
alleleflux-terminal-nucleotide \
    --significant_sites significant_sites/significant_sites.tsv \
    --profile_dir profiles/ \
    --metadata metadata/sample_metadata.tsv \
    --group treatment \
    --timepoint post \
    --output results/terminal/ \
    --p_value_column q_value \
    --p_value_threshold 0.05

# Step 3: Track alleles
alleleflux-track-alleles \
    --mag-id TEST_MAG_001 \
    --anchor-file results/terminal/TEST_MAG_001/TEST_MAG_001_terminal_nucleotides.tsv \
    --metadata metadata/sample_metadata.tsv \
    --output-dir results/tracking/

# Step 4: Generate plots
alleleflux-plot-trajectories \
    --input_file results/tracking/TEST_MAG_001_frequency_table.long.tsv \
    --n_sites_line 10 \
    --plot_types line box \
    --x_order pre post \
    --output_dir plots/TEST_MAG_001 \
    --output_format png

Expected Plot Outputs

After running the visualization pipeline, you should see plots like:

Line Plot (MAG_001_line_plot.png):

Shows allele frequency trajectories for top significant sites. Each line represents a single genomic site, with x-axis showing timepoints and y-axis showing anchor allele frequency (0-1). Different colors may represent different genes or groups.

Note

To generate example plots:

Run the complete workflow above with the example data, then find plots in the plots/ directory. For publication-quality figures, use --output_format pdf or --output_format svg.

Box Plot (MAG_001_box_plot.png):

Distribution of allele frequencies at each timepoint across all significant sites. Useful for seeing overall trends in allele frequency shifts.

Violin Plot (MAG_001_violin_plot.png):

Similar to box plots but shows the full distribution density, helpful for identifying bimodal patterns or skewed distributions.


Generating Screenshots

To generate the example plots shown in this documentation:

  1. Run the example workflow above

  2. Screenshots will be in plots/TEST_MAG_001/

For generating documentation images:

# Generate high-resolution PNG for documentation
alleleflux-plot-trajectories \
    --input_file results/tracking/TEST_MAG_001_frequency_table.long.tsv \
    --plot_types line box violin \
    --x_order pre post \
    --output_dir docs/source/_static/images/visualization_examples \
    --output_format png

Troubleshooting

No sites passing p-value threshold:

  • Lower --p_value_threshold (e.g., 0.1)

  • Check if --p_value_column matches your data (q_value vs min_p_value)

Empty frequency tables:

  • Ensure --profile_dir structure matches expected format

  • Verify --metadata has correct file_path column

  • Check --min-cov-per-site isn’t too stringent

Missing timepoints in plots:

  • Verify --x_order matches values in your time column exactly

  • Check metadata has samples at all specified timepoints

Memory errors with large datasets:

  • Process one MAG at a time

  • Reduce --cpus to limit parallel processing


See Also