AlleleFlux Example Data¶
This directory contains minimal synthetic data for testing and learning AlleleFlux.
Contents¶
example_data/
├── README.md # This file
├── reference/
│ ├── combined_mags.fasta # Reference FASTA with 2 MAGs
│ ├── prodigal_genes.fna # Gene predictions
│ ├── mag_mapping.tsv # Contig-to-MAG mapping
│ └── gtdbtk_taxonomy.tsv # Mock GTDB taxonomy
├── metadata/
│ └── sample_metadata.tsv # Sample metadata (8 samples)
├── profiles/ # Pre-generated profile files
│ ├── control_subj1_pre/
│ │ ├── control_subj1_pre_TEST_MAG_001_profiled.tsv.gz
│ │ └── control_subj1_pre_TEST_MAG_002_profiled.tsv.gz
│ └── ... (8 sample directories)
├── significant_sites/
│ └── significant_sites.tsv # Example significant sites for visualization
└── config_example.yml # Working configuration file
Dataset Overview¶
2 MAGs:
TEST_MAG_001andTEST_MAG_0028 samples: 4 control, 4 treatment (2 timepoints each: pre/post)
4 subjects: Each with samples at both timepoints (longitudinal design)
~2000 positions per MAG: Sufficient to demonstrate allele frequency analysis
Usage¶
Running the Full Pipeline¶
cd /path/to/AlleleFlux
alleleflux run --config docs/source/examples/example_data/config_example.yml
Running Individual Steps¶
# Profile a sample (already done - profiles provided)
alleleflux-profile \
--bam_path /path/to/your.bam \
--fasta_path docs/source/examples/example_data/reference/combined_mags.fasta \
--prodigal_fasta docs/source/examples/example_data/reference/prodigal_genes.fna \
--mag_mapping_file docs/source/examples/example_data/reference/mag_mapping.tsv \
--output_dir output/profiles
# Run visualization workflow
alleleflux-terminal-nuc-analysis \
--significant_sites docs/source/examples/example_data/significant_sites/significant_sites.tsv \
--profile_dir docs/source/examples/example_data/profiles \
--metadata docs/source/examples/example_data/metadata/sample_metadata.tsv \
--group treatment \
--timepoint post \
--output results/terminal
Generating Larger Datasets¶
For testing with larger datasets, use the provided generation script:
python docs/source/examples/generate_synthetic_data.py \
--num_mags 10 \
--num_samples 20 \
--num_positions 5000 \
--output_dir my_test_data
See generate_synthetic_data.py --help for all options.
File Format Specifications¶
Profile Files (*_profiled.tsv.gz)¶
Column |
Type |
Description |
|---|---|---|
|
string |
Contig identifier |
|
int |
0-based genomic position |
|
string |
Overlapping gene ID |
|
string |
Reference base (A/C/G/T) |
|
int |
Count of adenine bases |
|
int |
Count of thymine bases |
|
int |
Count of guanine bases |
|
int |
Count of cytosine bases |
Metadata File (sample_metadata.tsv)¶
Column |
Description |
|---|---|
|
Unique sample identifier |
|
Path to BAM file (or profile directory) |
|
Biological replicate/subject ID |
|
Experimental group (control/treatment) |
|
Replicate letter within group |
|
Timepoint (pre/post) |
Significant Sites (significant_sites.tsv)¶
Column |
Description |
|---|---|
|
MAG identifier |
|
Contig identifier |
|
0-based position |
|
Gene identifier |
|
Statistical test used |
|
Minimum p-value across tests |
|
FDR-adjusted p-value |