# dN/dS Analysis Guide This guide covers dN/dS ratio analysis in AlleleFlux using the Nei-Gojobori (1986) method with path averaging. ## Overview The **dN/dS ratio** (ω) measures evolutionary selection by comparing: - **dN**: Rate of non-synonymous substitutions (change amino acid) - **dS**: Rate of synonymous substitutions (silent mutations) **Interpretation:** | dN/dS | Selection Mode | |-------|----------------| | < 1 | **Purifying selection** – deleterious mutations removed | | ≈ 1 | **Neutral evolution** – no selective pressure | | > 1 | **Positive selection** – advantageous mutations favored | ## Methodology ### Nei-Gojobori Method with Path Averaging AlleleFlux implements the **NG86 method** to calculate dN/dS ratios from statistically significant sites identified by the workflow. #### Key Features 1. **Potential Sites Calculation** - For each codon, determine how many possible single-nucleotide changes are synonymous vs. non-synonymous - Results in fractional S and N values (e.g., ATG: S=1.0, N=8.0) 2. **Path Averaging for Multi-Position Changes** - When multiple positions change in a codon (k=2 or k=3), enumerate all possible mutational pathways - Average S/NS classifications across paths - Exclude pathways through intermediate stop codons - Example: AAA→TTG has 6 pathways (3!), each potentially classifying mutations differently 3. **Codon-Level Analysis** - Pre-computed cache for all 64 codons (potential sites) - Pre-computed cache for all 4,096 codon pairs (path-averaged counts) - O(1) lookups during analysis for efficiency **Formula:** ```text For each gene: Total_S = Σ(potential S sites for each codon) Total_N = Σ(potential N sites for each codon) Observed_S = Σ(path-averaged S count for changed codons) Observed_N = Σ(path-averaged NS count for changed codons) pS = Observed_S / Total_S pN = Observed_N / Total_N dN/dS = pN / pS ``` ### Workflow Integration dN/dS analysis follows statistical testing: ```text Statistical tests → p_value_summary.py → significant sites ↓ alleleflux-dnds-from-timepoints → dN/dS ratios ``` **Required inputs:** 1. Significant sites file from `alleleflux-p-value-summary` 2. Profile files for ancestral/derived timepoints 3. Prodigal gene predictions 4. Reference FASTA ## Usage ### Basic Command ```bash alleleflux-dnds-from-timepoints \ --input significant_sites_summary.tsv \ --output output_dir/ \ --mag_id Bacteroides_001 \ --profiles_dir profiles/ \ --prodigal_fasta genes.fna \ --fasta reference.fasta \ --ancestral_timepoint pre \ --derived_timepoint post ``` ### Key Parameters | Parameter | Description | |-----------|-------------| | `--input` | Significant sites TSV from `p_value_summary` | | `--mag_id` | MAG identifier to analyze | | `--p_value_column` | `q_value` (FDR-corrected) or `min_p_value` (default: `q_value`) | | `--p_value_threshold` | Significance cutoff (default: 0.05) | | `--test_type` | Filter by test (e.g., `two_sample_paired_tTest`) | | `--group_analyzed` | Filter by group | | `--ancestral_timepoint` | Earlier timepoint for sequence reconstruction | | `--derived_timepoint` | Later timepoint for comparison | ### Advanced Examples **Filter by test type and group:** ```bash alleleflux-dnds-from-timepoints \ --input p_value_summary.tsv \ --output dnds_results/ \ --mag_id Lachnospira_002 \ --test_type two_sample_paired_tTest \ --group_analyzed treatment \ --p_value_column q_value \ --p_value_threshold 0.01 \ --profiles_dir profiles/ \ --prodigal_fasta genes.fna \ --fasta ref.fasta \ --ancestral_timepoint day0 \ --derived_timepoint day30 ``` **Process multiple MAGs:** ```bash for mag in $(cat mag_list.txt); do alleleflux-dnds-from-timepoints \ --input p_value_summary.tsv \ --output dnds_results/ \ --mag_id $mag \ --profiles_dir profiles/ \ --prodigal_fasta genes.fna \ --fasta ref.fasta \ --ancestral_timepoint pre \ --derived_timepoint post done ``` ## Output Files ### Codon Events File **Path:** `{output}/{mag}_codon_events_ng86.tsv.gz` Codon-level substitution details with path-averaged counts. | Column | Type | Description | |--------|------|-------------| | `gene_id` | str | Gene identifier | | `codon_number` | int | Position in gene (1-based) | | `codon_before` | str | Ancestral codon | | `codon_after` | str | Derived codon | | `aa_before`, `aa_after` | str | Amino acids | | `k` | int | Number of changed positions (1, 2, or 3) | | `fractional_S` | float | Path-averaged synonymous count | | `fractional_NS` | float | Path-averaged non-synonymous count | | `potential_S`, `potential_N` | float | Expected S/N sites for this codon | ### Gene Summary File **Path:** `{output}/{mag}_gene_summary_ng86.tsv.gz` Gene-level dN/dS ratios. | Column | Description | |--------|-------------| | `gene_id` | Gene identifier | | `dN_dS` | dN/dS ratio | | `pS`, `pN` | Proportional synonymous/non-synonymous rates | | `observed_S`, `observed_N` | Total path-averaged counts | | `potential_S`, `potential_N` | Total expected S/N sites | | `num_codons_changed` | Count of changed codons | | `num_significant_sites` | Count of significant positions in gene | ### MAG Summary File **Path:** `{output}/{mag}_mag_summary_ng86.tsv.gz` MAG-level dN/dS summary. | Column | Description | |--------|-------------| | `mag_id` | MAG identifier | | `dN_dS` | Overall MAG dN/dS | | `num_genes` | Total genes analyzed | | `num_codons_changed` | Total changed codons across MAG | ### Global Summary File **Path:** `{output}/{mag}_global_summary_ng86.tsv` Single-row summary of entire analysis. Contains aggregate `dN_dS`, total sites analyzed, and overall statistics. ## Interpreting Results ### Understanding dN/dS Values | dN/dS Range | Interpretation | Example Genes | |-------------|----------------|---------------| | 0 - 0.3 | Strong purifying selection | Essential housekeeping genes | | 0.3 - 0.7 | Moderate constraint | Metabolic enzymes | | 0.7 - 1.3 | Neutral/weak selection | Structural proteins | | 1.3 - 3.0 | Positive selection | Antibiotic resistance genes | | > 3.0 | Strong positive selection | Host-pathogen interaction genes | **NaN values:** Occur when no synonymous substitutions observed (pS = 0). Common in short genes or low-divergence comparisons. ### Understanding k-Values The **k** value indicates how many positions changed in a codon: - **k=1**: Single-position change (e.g., ATG→ATC) – straightforward S/NS classification - **k=2**: Two positions changed (e.g., ATG→ACG) – 2! = 2 pathways averaged - **k=3**: Three positions changed (e.g., ATG→CAA) – 3! = 6 pathways averaged Higher k-values may indicate: - Longer time between samples - Stronger selection pressure - Less reliable dN/dS estimates (fewer intermediate observations) ### Biological Interpretation **High dN/dS (>1) scenarios:** 1. **Antibiotic exposure**: Resistance gene mutations selected 2. **Nutrient shifts**: Metabolic adaptation to new resources 3. **Immune evasion**: Surface protein diversification 4. **Novel environment**: Rapid adaptation to colonization **Low dN/dS (<0.5) scenarios:** 1. **Core metabolism**: Essential enzymes under constraint 2. **Structural genes**: Protein folding requirements 3. **Information processing**: Ribosomal proteins, polymerases ## Common Use Cases ### Use Case 1: Genes Under Selection **Goal:** Identify genes with dN/dS > 1 indicating positive selection. **Workflow:** ```bash # Run dN/dS analysis alleleflux-dnds-from-timepoints --input p_val_summary.tsv --output dnds/ --mag_id MAG001 [...] # Filter gene summary for high dN/dS awk '$3 > 1' dnds/MAG001_gene_summary_ng86.tsv.gz | gzip -cd | sort -k3,3nr > genes_under_selection.tsv ``` **Interpretation:** Genes with dN/dS > 1 are candidates for adaptive evolution. ### Use Case 2: Comparing Selection Across MAGs **Goal:** Compare selection patterns between microbial taxa. **Workflow:** ```bash # Process multiple MAGs for mag in MAG001 MAG002 MAG003; do alleleflux-dnds-from-timepoints --input p_val_summary.tsv --mag_id $mag --output dnds/ [...] done # Extract MAG-level dN/dS for mag in dnds/*_mag_summary_ng86.tsv.gz; do echo "$(basename $mag)" gzip -cd $mag | awk 'NR==2 {print $3}' done ``` **Interpretation:** MAGs with higher dN/dS may be under stronger selective pressure. ### Use Case 3: Pathway-Level Analysis **Goal:** Determine if functional pathways show coordinated selection. **Workflow:** ```bash # Annotate genes (e.g., with KEGG) # Join gene summary with annotations # Summarize by pathway awk '{pathway=$5; dnds=$3; sum[pathway]+=dnds; count[pathway]++} END {for (p in sum) print p, sum[p]/count[p]}' annotated_genes.tsv ``` **Interpretation:** Pathways with elevated mean dN/dS suggest functional adaptation. ## Limitations and Considerations ### Low Coverage Regions - Insufficient depth → unreliable allele frequency estimates - Apply QC filters (`breadth_threshold` in config) ### Multiple Substitutions (k>1) - Path averaging assumes all pathways equally likely - High k-values may reduce accuracy - Consider time between sampling points ### Stop Codons - Intermediate stop codons in pathways are excluded (biologically unrealistic) - Final stop codons allowed (represent valid termination) ### Sample Size - Small sample sizes → less statistical power - Ensure sufficient replicates for significance testing upstream ### Recombination - dN/dS assumes vertical inheritance - Horizontal gene transfer can inflate dN/dS estimates - Consider phylogenetic context ### Time Between Samples - Short intervals: fewer substitutions, many NaN values - Long intervals: multiple substitutions (high k), path uncertainty - Optimal: intermediate divergence (1-5% nucleotide difference) ## Troubleshooting **Error: "Required columns not found"** : Check significant sites file has columns: `mag_id`, `contig`, `position`, `gene_id`, `q_value` or `min_p_value` **Warning: "Reference base mismatch"** : Profile data reference doesn't match FASTA. Verify consistent reference used throughout workflow. **Many NaN dN/dS values** : - No synonymous substitutions observed (pS=0) – common in short genes or low divergence - Solution: Use longer time intervals or filter for genes with k≥2 **Script runs slowly** : - Large number of significant sites → long runtime - Solution: Use stricter p-value threshold or parallelize across MAGs ### Validation Sanity checks: ```bash # Check potential sites make sense (should sum to ~3× gene length) gzip -cd *_potential_sites_ng86.tsv.gz | awk '{s+=$2; n+=$3} END {print "S:", s, "N:", n, "Ratio:", n/s}' # Expected N/S ratio ~2.5-3 for typical bacterial genes # Check codon events have fractional counts summing correctly gzip -cd *_codon_events_ng86.tsv.gz | awk '$9 != "" {print $9, $10}' | head ``` ## Further Reading **Primary References:** - Nei & Gojobori (1986). *Mol Biol Evol* 3:418-426. [DOI:10.1093/oxfordjournals.molbev.a040410](https://doi.org/10.1093/oxfordjournals.molbev.a040410) - Yang & Nielsen (2000). *Mol Biol Evol* 17:32-43. (Review of dN/dS methods) **Related AlleleFlux Documentation:** - [Interpreting Results](interpreting_results.md) – Allele frequency interpretation - [Output Files Reference](../reference/outputs.md) – Output file specifications - [CLI Reference](../reference/cli_reference.md) – Command-line options **External Tools:** - [PAML](http://abacus.gene.ucl.ac.uk/software/paml.html) – Phylogeny-based dN/dS - [HyPhy](http://hyphy.org/) – Hypothesis testing for positive selection - [SNPGenie](https://github.com/chasewnelson/SNPGenie) – Within-population dN/dS