dN/dS Analysis Guide¶
This guide covers dN/dS ratio analysis in AlleleFlux using the Nei-Gojobori (1986) method with path averaging.
Overview¶
The dN/dS ratio (ω) measures evolutionary selection by comparing:
dN: Rate of non-synonymous substitutions (change amino acid)
dS: Rate of synonymous substitutions (silent mutations)
Interpretation:
dN/dS |
Selection Mode |
|---|---|
< 1 |
Purifying selection – deleterious mutations removed |
≈ 1 |
Neutral evolution – no selective pressure |
> 1 |
Positive selection – advantageous mutations favored |
Methodology¶
Nei-Gojobori Method with Path Averaging¶
AlleleFlux implements the NG86 method to calculate dN/dS ratios from statistically significant sites identified by the workflow.
Key Features¶
Potential Sites Calculation
For each codon, determine how many possible single-nucleotide changes are synonymous vs. non-synonymous
Results in fractional S and N values (e.g., ATG: S=1.0, N=8.0)
Path Averaging for Multi-Position Changes
When multiple positions change in a codon (k=2 or k=3), enumerate all possible mutational pathways
Average S/NS classifications across paths
Exclude pathways through intermediate stop codons
Example: AAA→TTG has 6 pathways (3!), each potentially classifying mutations differently
Codon-Level Analysis
Pre-computed cache for all 64 codons (potential sites)
Pre-computed cache for all 4,096 codon pairs (path-averaged counts)
O(1) lookups during analysis for efficiency
Formula:
For each gene:
Total_S = Σ(potential S sites for each codon)
Total_N = Σ(potential N sites for each codon)
Observed_S = Σ(path-averaged S count for changed codons)
Observed_N = Σ(path-averaged NS count for changed codons)
pS = Observed_S / Total_S
pN = Observed_N / Total_N
dN/dS = pN / pS
Workflow Integration¶
dN/dS analysis follows statistical testing:
Statistical tests → p_value_summary.py → significant sites
↓
alleleflux-dnds-from-timepoints → dN/dS ratios
Required inputs:
Significant sites file from
alleleflux-p-value-summaryProfile files for ancestral/derived timepoints
Prodigal gene predictions
Reference FASTA
Usage¶
Basic Command¶
alleleflux-dnds-from-timepoints \
--input significant_sites_summary.tsv \
--output output_dir/ \
--mag_id Bacteroides_001 \
--profiles_dir profiles/ \
--prodigal_fasta genes.fna \
--fasta reference.fasta \
--ancestral_timepoint pre \
--derived_timepoint post
Key Parameters¶
Parameter |
Description |
|---|---|
|
Significant sites TSV from |
|
MAG identifier to analyze |
|
|
|
Significance cutoff (default: 0.05) |
|
Filter by test (e.g., |
|
Filter by group |
|
Earlier timepoint for sequence reconstruction |
|
Later timepoint for comparison |
Advanced Examples¶
Filter by test type and group:
alleleflux-dnds-from-timepoints \
--input p_value_summary.tsv \
--output dnds_results/ \
--mag_id Lachnospira_002 \
--test_type two_sample_paired_tTest \
--group_analyzed treatment \
--p_value_column q_value \
--p_value_threshold 0.01 \
--profiles_dir profiles/ \
--prodigal_fasta genes.fna \
--fasta ref.fasta \
--ancestral_timepoint day0 \
--derived_timepoint day30
Process multiple MAGs:
for mag in $(cat mag_list.txt); do
alleleflux-dnds-from-timepoints \
--input p_value_summary.tsv \
--output dnds_results/ \
--mag_id $mag \
--profiles_dir profiles/ \
--prodigal_fasta genes.fna \
--fasta ref.fasta \
--ancestral_timepoint pre \
--derived_timepoint post
done
Output Files¶
Codon Events File¶
Path: {output}/{mag}_codon_events_ng86.tsv.gz
Codon-level substitution details with path-averaged counts.
Column |
Type |
Description |
|---|---|---|
|
str |
Gene identifier |
|
int |
Position in gene (1-based) |
|
str |
Ancestral codon |
|
str |
Derived codon |
|
str |
Amino acids |
|
int |
Number of changed positions (1, 2, or 3) |
|
float |
Path-averaged synonymous count |
|
float |
Path-averaged non-synonymous count |
|
float |
Expected S/N sites for this codon |
Gene Summary File¶
Path: {output}/{mag}_gene_summary_ng86.tsv.gz
Gene-level dN/dS ratios.
Column |
Description |
|---|---|
|
Gene identifier |
|
dN/dS ratio |
|
Proportional synonymous/non-synonymous rates |
|
Total path-averaged counts |
|
Total expected S/N sites |
|
Count of changed codons |
|
Count of significant positions in gene |
MAG Summary File¶
Path: {output}/{mag}_mag_summary_ng86.tsv.gz
MAG-level dN/dS summary.
Column |
Description |
|---|---|
|
MAG identifier |
|
Overall MAG dN/dS |
|
Total genes analyzed |
|
Total changed codons across MAG |
Global Summary File¶
Path: {output}/{mag}_global_summary_ng86.tsv
Single-row summary of entire analysis.
Contains aggregate dN_dS, total sites analyzed, and overall statistics.
Interpreting Results¶
Understanding dN/dS Values¶
dN/dS Range |
Interpretation |
Example Genes |
|---|---|---|
0 - 0.3 |
Strong purifying selection |
Essential housekeeping genes |
0.3 - 0.7 |
Moderate constraint |
Metabolic enzymes |
0.7 - 1.3 |
Neutral/weak selection |
Structural proteins |
1.3 - 3.0 |
Positive selection |
Antibiotic resistance genes |
> 3.0 |
Strong positive selection |
Host-pathogen interaction genes |
NaN values: Occur when no synonymous substitutions observed (pS = 0). Common in short genes or low-divergence comparisons.
Understanding k-Values¶
The k value indicates how many positions changed in a codon:
k=1: Single-position change (e.g., ATG→ATC) – straightforward S/NS classification
k=2: Two positions changed (e.g., ATG→ACG) – 2! = 2 pathways averaged
k=3: Three positions changed (e.g., ATG→CAA) – 3! = 6 pathways averaged
Higher k-values may indicate:
Longer time between samples
Stronger selection pressure
Less reliable dN/dS estimates (fewer intermediate observations)
Biological Interpretation¶
High dN/dS (>1) scenarios:
Antibiotic exposure: Resistance gene mutations selected
Nutrient shifts: Metabolic adaptation to new resources
Immune evasion: Surface protein diversification
Novel environment: Rapid adaptation to colonization
Low dN/dS (<0.5) scenarios:
Core metabolism: Essential enzymes under constraint
Structural genes: Protein folding requirements
Information processing: Ribosomal proteins, polymerases
Common Use Cases¶
Use Case 1: Genes Under Selection¶
Goal: Identify genes with dN/dS > 1 indicating positive selection.
Workflow:
# Run dN/dS analysis
alleleflux-dnds-from-timepoints --input p_val_summary.tsv --output dnds/ --mag_id MAG001 [...]
# Filter gene summary for high dN/dS
awk '$3 > 1' dnds/MAG001_gene_summary_ng86.tsv.gz | gzip -cd | sort -k3,3nr > genes_under_selection.tsv
Interpretation: Genes with dN/dS > 1 are candidates for adaptive evolution.
Use Case 2: Comparing Selection Across MAGs¶
Goal: Compare selection patterns between microbial taxa.
Workflow:
# Process multiple MAGs
for mag in MAG001 MAG002 MAG003; do
alleleflux-dnds-from-timepoints --input p_val_summary.tsv --mag_id $mag --output dnds/ [...]
done
# Extract MAG-level dN/dS
for mag in dnds/*_mag_summary_ng86.tsv.gz; do
echo "$(basename $mag)"
gzip -cd $mag | awk 'NR==2 {print $3}'
done
Interpretation: MAGs with higher dN/dS may be under stronger selective pressure.
Use Case 3: Pathway-Level Analysis¶
Goal: Determine if functional pathways show coordinated selection.
Workflow:
# Annotate genes (e.g., with KEGG)
# Join gene summary with annotations
# Summarize by pathway
awk '{pathway=$5; dnds=$3; sum[pathway]+=dnds; count[pathway]++}
END {for (p in sum) print p, sum[p]/count[p]}' annotated_genes.tsv
Interpretation: Pathways with elevated mean dN/dS suggest functional adaptation.
Limitations and Considerations¶
Low Coverage Regions¶
Insufficient depth → unreliable allele frequency estimates
Apply QC filters (
breadth_thresholdin config)
Multiple Substitutions (k>1)¶
Path averaging assumes all pathways equally likely
High k-values may reduce accuracy
Consider time between sampling points
Stop Codons¶
Intermediate stop codons in pathways are excluded (biologically unrealistic)
Final stop codons allowed (represent valid termination)
Sample Size¶
Small sample sizes → less statistical power
Ensure sufficient replicates for significance testing upstream
Recombination¶
dN/dS assumes vertical inheritance
Horizontal gene transfer can inflate dN/dS estimates
Consider phylogenetic context
Time Between Samples¶
Short intervals: fewer substitutions, many NaN values
Long intervals: multiple substitutions (high k), path uncertainty
Optimal: intermediate divergence (1-5% nucleotide difference)
Troubleshooting¶
- Error: “Required columns not found”
Check significant sites file has columns:
mag_id,contig,position,gene_id,q_valueormin_p_value- Warning: “Reference base mismatch”
Profile data reference doesn’t match FASTA. Verify consistent reference used throughout workflow.
- Many NaN dN/dS values
No synonymous substitutions observed (pS=0) – common in short genes or low divergence
Solution: Use longer time intervals or filter for genes with k≥2
- Script runs slowly
Large number of significant sites → long runtime
Solution: Use stricter p-value threshold or parallelize across MAGs
Validation¶
Sanity checks:
# Check potential sites make sense (should sum to ~3× gene length)
gzip -cd *_potential_sites_ng86.tsv.gz | awk '{s+=$2; n+=$3} END {print "S:", s, "N:", n, "Ratio:", n/s}'
# Expected N/S ratio ~2.5-3 for typical bacterial genes
# Check codon events have fractional counts summing correctly
gzip -cd *_codon_events_ng86.tsv.gz | awk '$9 != "" {print $9, $10}' | head
Further Reading¶
Primary References:
Nei & Gojobori (1986). Mol Biol Evol 3:418-426. DOI:10.1093/oxfordjournals.molbev.a040410
Yang & Nielsen (2000). Mol Biol Evol 17:32-43. (Review of dN/dS methods)
Related AlleleFlux Documentation:
Interpreting Results – Allele frequency interpretation
Output Files Reference – Output file specifications
CLI Reference – Command-line options
External Tools: