Bench¶
The bench validates sploink's core claim — per-step hardware routing produces meaningful cost reduction on real multi-step workloads — and characterizes the cost / quality trade-off honestly.
Setup¶
- Task: HotpotQA distractor — multi-hop QA over ~10 candidate paragraphs per question.
- Workflow: 4-step RAG agent (
rerank → extract → reason → verify). (Aclassifystep was removed 2026-05-22 after it was identified as dead code — its output wasn't consumed downstream.) - Model: Llama 3.1 8B Instruct on both substrates — so the only variable is hardware.
- Substrates:
- CPU — Ollama
llama3.1:8b(4-bit quantized) on a MacBook - LPU — Groq
llama-3.1-8b-instant - Metrics: F1 + EM against gold answers; wall-clock latency; cost from observed token counts × published rates.
Strategies compared¶
| Strategy | rerank | extract | reason | verify | What it tests |
|---|---|---|---|---|---|
cpu_only |
cpu | cpu | cpu | cpu | "What if I never touch the cloud?" |
lpu_only |
lpu | lpu | lpu | lpu | "What if everything ran on LPU?" |
hw_routed |
cpu | cpu | lpu | cpu | The sploink thesis — cheap steps local, reasoning to LPU |
Results (n=30, all completed on every strategy)¶
| Strategy | n | Avg cost / query | Avg F1 | Avg EM | Avg latency |
|---|---|---|---|---|---|
cpu_only |
30 / 30 | $0 | 0.594 | 0.500 | 13.2s |
lpu_only |
30 / 30 | $0.000115 | 0.721 | 0.600 | 20.5s* |
hw_routed |
30 / 30 | $0.000009 | 0.589 | 0.533 | 10.5s |
* lpu_only latency is inflated by Groq free-tier rate-limit retries — on a paid Groq tier this would be ~2-3s, not 20s.
Headline numbers (hw_routed vs lpu_only)¶
| Result | |
|---|---|
| Cost reduction | 92.5% ($0.000115 → $0.000009 per query) |
| F1 delta | -0.132 (an 18% relative drop) |
| Latency reduction (on free-tier Groq) | 49% — but this is partly a rate-limit artifact, not a true win |
What this validates — honestly¶
Validated: routing cheap steps off the LPU produces dramatic cost reduction. ~10× cheaper per query, free-tier rate-limiting also disappears as a bottleneck.
Not yet validated: "preserved quality." F1 drops ~13 points with the current policy. That's not free.
Why F1 drops: even though both substrates run "Llama 3.1 8B," they're not the same in practice. Ollama's 4-bit quantized model on CPU produces subtly worse rerank scores and extract outputs; that error propagates through the workflow. The "same model, different hardware" framing is an idealization — quantization and decoder implementation matter.
The recoverable gap¶
The F1 drop is not an architectural ceiling — it's a policy artifact. Things we'll test next:
| Lever | What it tries | Expected effect |
|---|---|---|
Route rerank to LPU too |
Keep extract/verify on CPU but give rerank LPU precision | Should recover most of the F1 gap; modest cost increase |
FP16 Ollama (llama3.1:8b-instruct-fp16) |
Remove Q4 quantization from the CPU substrate | Closes ~half the F1 gap, same cost |
qwen2.5:14b on CPU |
Stronger CPU model that's better at structured outputs | Trades latency for F1, same cost |
| Constrained JSON decoding | Force rerank's JSON output to be parseable | Eliminates parse failures eating F1 |
The next iteration will be a four-cell policy sweep that produces a cost / quality curve, not a single point. The pitch becomes: "here's the cost-quality tradeoff; customers pick where to live on the curve."
Reproducing¶
# 1. Install. The bench package is shipped in the wheel as of v0.1.3
# (before that you had to clone the repo).
pip install "sploink[bench]"
# 2. Install Ollama from https://ollama.com/download, then pull the local model:
ollama pull llama3.1:8b
# 3. Set GROQ_API_KEY for the cloud calls (free tier at https://console.groq.com/keys).
export GROQ_API_KEY="gsk_..."
# 4. Run each strategy at n=30
python -m bench.run --n 30 --graphs parallel_dag --strategy cpu_only --out bench/results/v2_cpu.csv
python -m bench.run --n 30 --graphs parallel_dag --strategy lpu_only --out bench/results/v2_lpu.csv
python -m bench.run --n 30 --graphs parallel_dag --strategy hw_routed --out bench/results/v2_hw.csv
# 5. Intersection-F1 comparison (apples to apples across runs)
python -m bench.compare bench/results/v2_*.csv
# 6. Local dashboard with the savings hero + bar charts
python -m sploink.dashboard
Approximate run cost: ~$0.005 in Groq API spend at n=30 (the lpu_only run). The other two strategies are free.
If python -m bench.run says "No module named 'bench'": you have a pre-v0.1.3 install. Upgrade with pip install --upgrade "sploink[bench]".
Caveats¶
hw_routed's latency advantage is partly an artifact of Groq's free-tier rate-limiting onlpu_only. A paid Groq tier would invert this —lpu_onlywould be ~5× faster thanhw_routed. Latency comparisons here should be read for cost-context, not as absolute "sploink is faster."- HotpotQA distractor is a stand-in for enterprise multi-hop QA workloads (Glean / Hebbia / Harvey-style). Real workloads will have different prompts, different paragraph distributions, and possibly different optimal routing policies.
- All numbers are from a single n=30 run with seed-stable example ordering. Tight error bars need n≥100 and ideally cross-seed averaging.
- The
decomposedgraph variant exists in the codebase but isn't part of this experiment — that's the topology axis, not the substrate axis.
What we still want to know¶
- Does FP16 Ollama recover the F1 gap?
- Which step routing decision matters most? (We suspect
rerank.) - Does the cost/quality curve have a "sweet spot" where 90%+ savings come with <5 F1 drop?
- Does this generalize beyond HotpotQA? (TriviaQA, MS MARCO, custom enterprise workloads as design partners come on.)