16S Pipeline Validation — wetSpring¶
The complete 16S metagenomics pipeline — FASTQ parsing, quality filtering, dereplication, DADA2 denoising, chimera detection, taxonomy classification, diversity calculation, and UniFrac — implemented in sovereign Rust with 1 runtime dependency (flate2 for gzip).
This notebook loads frozen validation results from wetSpring experiments and visualizes the evidence that the Rust pipeline matches the established Galaxy/QIIME2/Python pipeline at machine-epsilon precision.
Data sources: experiments/results/ (frozen JSON artifacts)
Reproduce: Run any validation binary with cargo run --release --bin <name>
in the wetSpring repository. See primals.eco/lab/reproduce.
For other springs: adapt this pattern by loading your own experiments/results/ JSONs.
The cell structure (load → parse → visualize → provenance) is the template.
Galaxy bootstrap: 8/8 checks Track 2 (LC-MS): 8/8 checks 16S controls: 1 BioProject(s) loaded R/vegan parity: 19 metrics
The Pipeline¶
Raw FASTQ (NCBI SRA) ← real data, not simulated
│
├─ Parse (sovereign FASTQ parser)
├─ Quality filter (Q≥20, length≥200)
├─ Merge paired-end reads
├─ Dereplicate (unique sequences)
├─ DADA2 denoise (error model → ASVs)
├─ Chimera detection (de novo + reference)
├─ Taxonomy (Naïve Bayes, SILVA 138.2)
├─ Diversity (Shannon, Simpson, Chao1, UniFrac)
└─ PCoA ordination
Every step has a Python/R baseline. Every step has a Rust implementation. Parity is checked at machine epsilon (1e-15 for f64).
Galaxy Bootstrap — Exp 001¶
The first experiment: reproduce the Galaxy/QIIME2 "Moving Pictures" tutorial pipeline entirely in Rust.
Track 2: LC-MS Feature Extraction¶
Asari (mass spectrometry feature extraction) and FindPFAS (PFAS screening) validated against Python implementations.
Diversity Parity: Rust vs R/vegan¶
Diversity metrics validated against R's vegan package (v2.7.3).
Every metric matches to machine epsilon.
R/vegan v2.7.3 on R 4.1.2 Generated: 2026-03-10 Metric R/vegan Value Status ------------------------------------------------------- Shannon (uniform 10) 2.302585092994045 [OK] Simpson (uniform 10) 0.900000000000000 [OK] Shannon (skewed) 0.540841946817804 [OK] Simpson (skewed) 0.187286385484584 [OK] Bray-Curtis (a,b) 0.400000000000000 [OK] Chao1 estimate 20.000000000000000 [OK] Pielou (uniform) 1.000000000000000 [OK] Pielou (skewed) 0.234884673084784 [OK] Rarefaction monotonic: True Bray-Curtis symmetric: True All metrics match Rust implementation at machine epsilon (1e-15).
Real NCBI Data — 16S Controls¶
Validation against real NCBI BioProject data (not simulated). The sovereign FASTQ parser processes actual sequencing reads and diversity metrics are cross-validated against Python/NumPy.
BioProject: PRJNA488170 / SRR7760408 (Nannochloropsis outdoor 16S, Wageningen)
Reads parsed: 50,000
After QC: 49,650 (99.3%)
Unique sequences: 1,345
Diversity:
Shannon: 7.030661
Simpson: 0.998799
Observed: 1345
Chao1: 1345.0
Elapsed: 1.21s
Python: 3.12.13 / NumPy 2.4.3
Validation Summary¶
| Component | Checks | Status |
|---|---|---|
| Galaxy bootstrap (Exp 001) | 8/8 | PASS |
| Track 2 LC-MS (Asari + PFAS) | 8/8 | PASS |
| R/vegan diversity parity | 8 metrics | PASS |
| NCBI real-data 16S | 1 BioProject | PASS |
The sovereign Rust pipeline matches Galaxy/QIIME2, Python/SciPy, and R/vegan at machine-epsilon precision across all tested domains.
Provenance: All results are content-addressed via BLAKE3 hashes, tracked in rhizoCrypt DAG sessions, committed to the loamSpine ledger, and witnessed with ed25519 signatures via sweetGrass braid.
Reproduce: See primals.eco/lab/reproduce
Source: syntheticChemistry/wetSpring