Skip to content

Sploink

Heterogeneous compute routing for AI agent workflows.

Sploink intercepts each LLM call in your agent code, classifies it by step type, and routes it to the hardware architecture (CPU, LPU, GPU, NPU, frontier API) where it's most cost-effective. Same agent code, different per-step compute placement, 60-80% inference cost savings on multi-step workloads.

WORKFLOW STEPS SPLOINK HARDWARE rerank score paragraphs 400 tok extract pull relevant facts 300 tok reason synthesize answer 60 tok verify answer follows facts? 6 tok router classify each call pick hardware pick substrate CPU Ollama, local · llama3.1:8b $0 marginal 3 calls routed here → LPU Groq, cloud · llama-3.1-8b-instant $0.05/M in tokens 1 call routed here → Same model on both hardware. The reason step goes to LPU; rerank / extract / verify stay on free CPU.

Get started in 30 seconds → View on GitHub →


Why per-step hardware routing?

A typical agent workflow has 5–20 distinct LLM calls per request — classification, reranking, extraction, reasoning, verification. Each step has a wildly different compute profile, but most stacks route every step to the same frontier API and overpay on the easy ones.

The cost gap between hardware architectures for the same model is huge:

Hardware $ / M input tokens (Llama 8B) Speed vs Frontier
Frontier API (Anthropic Sonnet) $3.00 medium
GPU rental (Together) $0.18 medium 17× cheaper
LPU (Groq) $0.05 fast 60× cheaper
CPU (Ollama local) $0 slow ∞× cheaper

Sploink routes the cheap steps to cheap hardware, keeps the hard steps on capable hardware. The cost compounds across every step of every query.


In 30 seconds

pip install "sploink[anthropic]"   # or [groq], [ollama], [all]
export ANTHROPIC_API_KEY="sk-ant-..."   # your provider's key, not sploink-specific
import sploink
from anthropic import Anthropic   # or groq, openai, ollama, together

sploink.wrap()                    # patch every supported SDK
sploink.enable_routing()          # opt into per-step routing

client = Anthropic()
# ... your existing agent code, unchanged ...

sploink.trace.print_summary()     # see per-step cost / latency / hardware

That's it. The interception is transparent — no agent-code changes needed.

Full quickstart →


Two-layer architecture

Sploink routes in two decoupled layers:

Layer 1 — Routing policy. Workflow step → hardware type. "Classify is cheap → CPU. Reasoning needs low latency → LPU." This is sploink's strategic decision.

Layer 2 — Substrate selection. Hardware type → specific provider instance. "Need a CPU? Ollama if available. Need an LPU? Groq." This is operational; eventually load-balancing across providers.

Adding a new provider (Salad for CPU, Cerebras for LPU) is a one-line config change. Adding a new hardware type is a config change plus a dispatcher.

Open the interactive architecture diagram →


Bench results

We ran HotpotQA (multi-hop QA) on a 4-step workflow (rerank → extract → reason → verify), Llama 3.1 8B on both substrates, n=30. Clean run — no rate-limit failures.

Strategy Cost / query F1 Latency
cpu_only (Ollama local) $0 0.594 13.2s
lpu_only (Groq LPU) $0.000115 0.721 20.5s
hw_routed (cheap → CPU, reason → LPU) $0.000009 0.589 10.5s

Headline (hw_routed vs lpu_only): 92.5% cost reduction, F1 delta -0.132.

The cost half of the thesis is strongly supported. The F1 trade-off with the current routing policy is real but tunable — the next iteration sweeps over which steps get routed to which hardware, and characterizes the full cost-quality curve.

Full bench methodology + caveats →


What's in the box

Module What it does
sploink.wrap() Monkey-patches Anthropic, Groq, OpenAI, Together, Ollama clients
sploink.trace Per-workflow CallRecords, JSONL persistence, summary tables
sploink.router Static rule table mapping step type → hardware type
sploink.graph DAG data structure for execution graphs
sploink.architecture Single-file HTML viz of workflow ↔ hardware bipartite

Composes with what you already use

Sploink doesn't replace your agent framework — it routes whatever you already have:

  • Using LangGraph or DSPy? Sploink wraps the LLM calls those frameworks emit.
  • Using plain Python? Sploink's wrap layer works on any of the major LLM SDKs.
  • Using OpenRouter for model-level routing? Sploink sits one layer below: substrate routing.

The model layer can stay yours; sploink owns the hardware decision.


Where we are

Status
Observability (sploink.wrap()) ✅ Works
Static rule-based routing ✅ Works
Two-layer architecture (policy + selection) ✅ Works
Trace visualization ✅ Works
Bench against HotpotQA ✅ Methodology stable, numbers refreshing
Learned routing from telemetry 🔜 Roadmap
Graph inference from trace 🔜 Roadmap
Hot-swap LangGraph / DSPy graphs 🔜 Roadmap

Full roadmap →


License

Sploink is MIT licensed. The SDK, bench, examples, and docs are all open source. Future enterprise features (hosted dashboard, learned routing models, managed substrate marketplace) will live in a separate proprietary product; the open-source library will stay open.


Built by Tim Nguyen · Pre-MVP, accepting feedback