dN/dS Analysis Guide¶

This guide covers dN/dS ratio analysis in AlleleFlux using the Nei-Gojobori (1986) method with path averaging.

Overview¶

The dN/dS ratio (ω) measures evolutionary selection by comparing:

dN: Rate of non-synonymous substitutions (change amino acid)
dS: Rate of synonymous substitutions (silent mutations)

Interpretation:

dN/dS	Selection Mode
< 1	Purifying selection – deleterious mutations removed
≈ 1	Neutral evolution – no selective pressure
> 1	Positive selection – advantageous mutations favored

Methodology¶

Nei-Gojobori Method with Path Averaging¶

AlleleFlux implements the NG86 method to calculate dN/dS ratios from statistically significant sites identified by the workflow.

Key Features¶

Potential Sites Calculation
- For each codon, determine how many possible single-nucleotide changes are synonymous vs. non-synonymous
- Results in fractional S and N values (e.g., ATG: S=1.0, N=8.0)
Path Averaging for Multi-Position Changes
- When multiple positions change in a codon (k=2 or k=3), enumerate all possible mutational pathways
- Average S/NS classifications across paths
- Exclude pathways through intermediate stop codons
- Example: AAA→TTG has 6 pathways (3!), each potentially classifying mutations differently
Codon-Level Analysis
- Pre-computed cache for all 64 codons (potential sites)
- Pre-computed cache for all 4,096 codon pairs (path-averaged counts)
- O(1) lookups during analysis for efficiency

Formula:

For each gene:
  Total_S = Σ(potential S sites for each codon)
  Total_N = Σ(potential N sites for each codon)

  Observed_S = Σ(path-averaged S count for changed codons)
  Observed_N = Σ(path-averaged NS count for changed codons)

  pS = Observed_S / Total_S
  pN = Observed_N / Total_N

  dN/dS = pN / pS

Workflow Integration¶

dN/dS analysis follows statistical testing:

Statistical tests → p_value_summary.py → significant sites
                                                   ↓
                  alleleflux-dnds-from-timepoints → dN/dS ratios

Required inputs:

Significant sites file from alleleflux-p-value-summary
Profile files for ancestral/derived timepoints
Prodigal gene predictions
Reference FASTA

Usage¶

Basic Command¶

alleleflux-dnds-from-timepoints \
    --input significant_sites_summary.tsv \
    --output output_dir/ \
    --mag_id Bacteroides_001 \
    --profiles_dir profiles/ \
    --prodigal_fasta genes.fna \
    --fasta reference.fasta \
    --ancestral_timepoint pre \
    --derived_timepoint post

Key Parameters¶

Parameter	Description
`--input`	Significant sites TSV from `p_value_summary`
`--mag_id`	MAG identifier to analyze
`--p_value_column`	`q_value` (FDR-corrected) or `min_p_value` (default: `q_value`)
`--p_value_threshold`	Significance cutoff (default: 0.05)
`--test_type`	Filter by test (e.g., `two_sample_paired_tTest`)
`--group_analyzed`	Filter by group
`--ancestral_timepoint`	Earlier timepoint for sequence reconstruction
`--derived_timepoint`	Later timepoint for comparison

Advanced Examples¶

Filter by test type and group:

alleleflux-dnds-from-timepoints \
    --input p_value_summary.tsv \
    --output dnds_results/ \
    --mag_id Lachnospira_002 \
    --test_type two_sample_paired_tTest \
    --group_analyzed treatment \
    --p_value_column q_value \
    --p_value_threshold 0.01 \
    --profiles_dir profiles/ \
    --prodigal_fasta genes.fna \
    --fasta ref.fasta \
    --ancestral_timepoint day0 \
    --derived_timepoint day30

Process multiple MAGs:

for mag in $(cat mag_list.txt); do
    alleleflux-dnds-from-timepoints \
        --input p_value_summary.tsv \
        --output dnds_results/ \
        --mag_id $mag \
        --profiles_dir profiles/ \
        --prodigal_fasta genes.fna \
        --fasta ref.fasta \
        --ancestral_timepoint pre \
        --derived_timepoint post
done

Output Files¶

Codon Events File¶

Path: {output}/{mag}_codon_events_ng86.tsv.gz

Codon-level substitution details with path-averaged counts.

Column	Type	Description
`gene_id`	str	Gene identifier
`codon_number`	int	Position in gene (1-based)
`codon_before`	str	Ancestral codon
`codon_after`	str	Derived codon
`aa_before`, `aa_after`	str	Amino acids
`k`	int	Number of changed positions (1, 2, or 3)
`fractional_S`	float	Path-averaged synonymous count
`fractional_NS`	float	Path-averaged non-synonymous count
`potential_S`, `potential_N`	float	Expected S/N sites for this codon

Gene Summary File¶

Path: {output}/{mag}_gene_summary_ng86.tsv.gz

Gene-level dN/dS ratios.

Column	Description
`gene_id`	Gene identifier
`dN_dS`	dN/dS ratio
`pS`, `pN`	Proportional synonymous/non-synonymous rates
`observed_S`, `observed_N`	Total path-averaged counts
`potential_S`, `potential_N`	Total expected S/N sites
`num_codons_changed`	Count of changed codons
`num_significant_sites`	Count of significant positions in gene

MAG Summary File¶

Path: {output}/{mag}_mag_summary_ng86.tsv.gz

MAG-level dN/dS summary.

Column	Description
`mag_id`	MAG identifier
`dN_dS`	Overall MAG dN/dS
`num_genes`	Total genes analyzed
`num_codons_changed`	Total changed codons across MAG

Global Summary File¶

Path: {output}/{mag}_global_summary_ng86.tsv

Single-row summary of entire analysis.

Contains aggregate dN_dS, total sites analyzed, and overall statistics.

Interpreting Results¶

Understanding dN/dS Values¶

dN/dS Range	Interpretation	Example Genes
0 - 0.3	Strong purifying selection	Essential housekeeping genes
0.3 - 0.7	Moderate constraint	Metabolic enzymes
0.7 - 1.3	Neutral/weak selection	Structural proteins
1.3 - 3.0	Positive selection	Antibiotic resistance genes
> 3.0	Strong positive selection	Host-pathogen interaction genes

NaN values: Occur when no synonymous substitutions observed (pS = 0). Common in short genes or low-divergence comparisons.

Understanding k-Values¶

The k value indicates how many positions changed in a codon:

k=1: Single-position change (e.g., ATG→ATC) – straightforward S/NS classification
k=2: Two positions changed (e.g., ATG→ACG) – 2! = 2 pathways averaged
k=3: Three positions changed (e.g., ATG→CAA) – 3! = 6 pathways averaged

Higher k-values may indicate:

Longer time between samples
Stronger selection pressure
Less reliable dN/dS estimates (fewer intermediate observations)

Biological Interpretation¶

High dN/dS (>1) scenarios:

Antibiotic exposure: Resistance gene mutations selected
Nutrient shifts: Metabolic adaptation to new resources
Immune evasion: Surface protein diversification
Novel environment: Rapid adaptation to colonization

Low dN/dS (<0.5) scenarios:

Core metabolism: Essential enzymes under constraint
Structural genes: Protein folding requirements
Information processing: Ribosomal proteins, polymerases

Common Use Cases¶

Use Case 1: Genes Under Selection¶

Goal: Identify genes with dN/dS > 1 indicating positive selection.

Workflow:

# Run dN/dS analysis
alleleflux-dnds-from-timepoints --input p_val_summary.tsv --output dnds/ --mag_id MAG001 [...]

# Filter gene summary for high dN/dS
awk '$3 > 1' dnds/MAG001_gene_summary_ng86.tsv.gz | gzip -cd | sort -k3,3nr > genes_under_selection.tsv

Interpretation: Genes with dN/dS > 1 are candidates for adaptive evolution.

Use Case 2: Comparing Selection Across MAGs¶

Goal: Compare selection patterns between microbial taxa.

Workflow:

# Process multiple MAGs
for mag in MAG001 MAG002 MAG003; do
    alleleflux-dnds-from-timepoints --input p_val_summary.tsv --mag_id $mag --output dnds/ [...]
done

# Extract MAG-level dN/dS
for mag in dnds/*_mag_summary_ng86.tsv.gz; do
    echo "$(basename $mag)"
    gzip -cd $mag | awk 'NR==2 {print $3}'
done

Interpretation: MAGs with higher dN/dS may be under stronger selective pressure.

Use Case 3: Pathway-Level Analysis¶

Goal: Determine if functional pathways show coordinated selection.

Workflow:

# Annotate genes (e.g., with KEGG)
# Join gene summary with annotations

# Summarize by pathway
awk '{pathway=$5; dnds=$3; sum[pathway]+=dnds; count[pathway]++}
     END {for (p in sum) print p, sum[p]/count[p]}' annotated_genes.tsv

Interpretation: Pathways with elevated mean dN/dS suggest functional adaptation.

Limitations and Considerations¶

Low Coverage Regions¶

Insufficient depth → unreliable allele frequency estimates
Apply QC filters (breadth_threshold in config)

Multiple Substitutions (k>1)¶

Path averaging assumes all pathways equally likely
High k-values may reduce accuracy
Consider time between sampling points

Stop Codons¶

Intermediate stop codons in pathways are excluded (biologically unrealistic)
Final stop codons allowed (represent valid termination)

Sample Size¶

Small sample sizes → less statistical power
Ensure sufficient replicates for significance testing upstream

Recombination¶

dN/dS assumes vertical inheritance
Horizontal gene transfer can inflate dN/dS estimates
Consider phylogenetic context

Time Between Samples¶

Short intervals: fewer substitutions, many NaN values
Long intervals: multiple substitutions (high k), path uncertainty
Optimal: intermediate divergence (1-5% nucleotide difference)

Troubleshooting¶

Error: “Required columns not found”

Check significant sites file has columns: mag_id, contig, position, gene_id, q_value or min_p_value

Warning: “Reference base mismatch”

Profile data reference doesn’t match FASTA. Verify consistent reference used throughout workflow.

Many NaN dN/dS values

No synonymous substitutions observed (pS=0) – common in short genes or low divergence
Solution: Use longer time intervals or filter for genes with k≥2

Script runs slowly

Large number of significant sites → long runtime
Solution: Use stricter p-value threshold or parallelize across MAGs

Validation¶

Sanity checks:

# Check potential sites make sense (should sum to ~3× gene length)
gzip -cd *_potential_sites_ng86.tsv.gz | awk '{s+=$2; n+=$3} END {print "S:", s, "N:", n, "Ratio:", n/s}'

# Expected N/S ratio ~2.5-3 for typical bacterial genes

# Check codon events have fractional counts summing correctly
gzip -cd *_codon_events_ng86.tsv.gz | awk '$9 != "" {print $9, $10}' | head

dN/dS Analysis Guide¶

Overview¶

Methodology¶

Nei-Gojobori Method with Path Averaging¶

Key Features¶

Workflow Integration¶

Usage¶

Basic Command¶

Key Parameters¶

Advanced Examples¶

Output Files¶

Codon Events File¶

Gene Summary File¶

MAG Summary File¶

Global Summary File¶

Interpreting Results¶

Understanding dN/dS Values¶

Understanding k-Values¶

Biological Interpretation¶

Common Use Cases¶

Use Case 1: Genes Under Selection¶

Use Case 2: Comparing Selection Across MAGs¶

Use Case 3: Pathway-Level Analysis¶

Limitations and Considerations¶

Low Coverage Regions¶

Multiple Substitutions (k>1)¶

Stop Codons¶

Sample Size¶

Recombination¶

Time Between Samples¶

Troubleshooting¶

Validation¶

Further Reading¶