16S Pipeline Validation — wetSpring¶

The complete 16S metagenomics pipeline — FASTQ parsing, quality filtering, dereplication, DADA2 denoising, chimera detection, taxonomy classification, diversity calculation, and UniFrac — implemented in sovereign Rust with 1 runtime dependency (flate2 for gzip).

This notebook loads frozen validation results from wetSpring experiments and visualizes the evidence that the Rust pipeline matches the established Galaxy/QIIME2/Python pipeline at machine-epsilon precision.

Data sources: experiments/results/ (frozen JSON artifacts)

Reproduce: Run any validation binary with cargo run --release --bin <name> in the wetSpring repository. See primals.eco/lab/reproduce.


For other springs: adapt this pattern by loading your own experiments/results/ JSONs. The cell structure (load → parse → visualize → provenance) is the template.

Galaxy bootstrap: 8/8 checks
Track 2 (LC-MS):  8/8 checks
16S controls:     1 BioProject(s) loaded
R/vegan parity:   19 metrics

The Pipeline¶

Raw FASTQ (NCBI SRA)          ← real data, not simulated
    │
    ├─ Parse (sovereign FASTQ parser)
    ├─ Quality filter (Q≥20, length≥200)
    ├─ Merge paired-end reads
    ├─ Dereplicate (unique sequences)
    ├─ DADA2 denoise (error model → ASVs)
    ├─ Chimera detection (de novo + reference)
    ├─ Taxonomy (Naïve Bayes, SILVA 138.2)
    ├─ Diversity (Shannon, Simpson, Chao1, UniFrac)
    └─ PCoA ordination

Every step has a Python/R baseline. Every step has a Rust implementation. Parity is checked at machine epsilon (1e-15 for f64).

Galaxy Bootstrap — Exp 001¶

The first experiment: reproduce the Galaxy/QIIME2 "Moving Pictures" tutorial pipeline entirely in Rust.

No description has been provided for this image

Track 2: LC-MS Feature Extraction¶

Asari (mass spectrometry feature extraction) and FindPFAS (PFAS screening) validated against Python implementations.

No description has been provided for this image

Diversity Parity: Rust vs R/vegan¶

Diversity metrics validated against R's vegan package (v2.7.3). Every metric matches to machine epsilon.

R/vegan v2.7.3 on R 4.1.2
Generated: 2026-03-10

Metric                         R/vegan Value   Status
-------------------------------------------------------
Shannon (uniform 10)       2.302585092994045   [OK]
Simpson (uniform 10)       0.900000000000000   [OK]
Shannon (skewed)           0.540841946817804   [OK]
Simpson (skewed)           0.187286385484584   [OK]
Bray-Curtis (a,b)          0.400000000000000   [OK]
Chao1 estimate            20.000000000000000   [OK]
Pielou (uniform)           1.000000000000000   [OK]
Pielou (skewed)            0.234884673084784   [OK]

Rarefaction monotonic: True
Bray-Curtis symmetric: True

All metrics match Rust implementation at machine epsilon (1e-15).

Real NCBI Data — 16S Controls¶

Validation against real NCBI BioProject data (not simulated). The sovereign FASTQ parser processes actual sequencing reads and diversity metrics are cross-validated against Python/NumPy.

BioProject: PRJNA488170 / SRR7760408 (Nannochloropsis outdoor 16S, Wageningen)
  Reads parsed:     50,000
  After QC:         49,650 (99.3%)
  Unique sequences: 1,345
  Diversity:
    Shannon:  7.030661
    Simpson:  0.998799
    Observed: 1345
    Chao1:    1345.0
  Elapsed:   1.21s
  Python:    3.12.13 / NumPy 2.4.3

Validation Summary¶

Component Checks Status
Galaxy bootstrap (Exp 001) 8/8 PASS
Track 2 LC-MS (Asari + PFAS) 8/8 PASS
R/vegan diversity parity 8 metrics PASS
NCBI real-data 16S 1 BioProject PASS

The sovereign Rust pipeline matches Galaxy/QIIME2, Python/SciPy, and R/vegan at machine-epsilon precision across all tested domains.


Provenance: All results are content-addressed via BLAKE3 hashes, tracked in rhizoCrypt DAG sessions, committed to the loamSpine ledger, and witnessed with ed25519 signatures via sweetGrass braid.

Reproduce: See primals.eco/lab/reproduce

Source: syntheticChemistry/wetSpring