case study · verba

A multilingual speech-to-draft workspace I'd actually trust to ship.

Raw transcript preservation, ambiguity surfaced, drafts compared across providers, eval loop logged — on Cloudflare Workers + D1 + Workers AI.

rolesolo · design + build
stackHTML/CSS/JS · CF Workers · D1 · Workers AI · OpenAI / Anthropic
timelineweekends · 2025–2026
statuslive · workspace open

Working across six natural languages, you notice the same failure mode in every speech-to-text product on the market: confidence is hidden, ambiguity is silently resolved, and the draft you receive looks confident even when the model isn't. Verba is the workspace built around the opposite assumption — preserve the raw, surface the doubt, and make the eval loop the artifact.

The problem

Most speech-to-draft tools optimise for a single output: a polished paragraph. That's the wrong unit. The unit that matters is the trace from raw audio to a written artifact you'd actually publish. Drop the trace and you can't tell whether the model heard you correctly, whether it translated faithfully, whether the draft is a clean summary or a confident hallucination. For an operator working across English, 中文, español, français, and português, that gap is unworkable.

The shape

The flow is sequential but every artifact compounds. Each pane hands its output to the next without a copy-paste, and the eval pane sees all of it.

The eval log is the artifact. Everything before it is scaffolding to make the log honest. — working notes

Constraints I picked

No build step beyond Vite. No framework. No backend other than Cloudflare. The same edge that serves joaquinh.com runs the inference. This was a deliberate choice — a cheap, well-understood edge stack is the one I'd actually push to production. If I can't build a workspace I trust on it, the "boring stack wins" thesis is just a posture.

Multilingual is not a feature; it's the test

If your speech-to-draft tool only works in English, you're not building for the world your users live in. Verba runs the eval suite in en/zh/es/fr/pt with code-switched audio, and the rubric weights faithfulness above concision deliberately — a polished mistranslation is worse than a clumsy accurate one.

Sample eval run · 5 multilingual cases

2026-03-19 · provider: local llama3.2 · cost $0.00
5 / 5cases run · zero failures
1,961 msavg latency · single-shot
86 %mandarin pass rate (4 cases)
100 %french pass rate (1 case)
case lang pair latency checks
Traditional Mandarin stays Traditional zh → zh · draft 3,423 ms ✓ traditional-script-preserved
✗ same-language-rewrite
Mandarin to Spanish zh → es · draft 1,624 ms ✓ spanish-output
✓ no-chinese-leftover
Colloquial Mandarin with profanity zh → en · draft 2,204 ms ✓ policy-behavior-visible
○ meaning-preserved (manual)
○ no-random-hallucination (manual)
Code-switched team update zh → en · draft 1,325 ms ✓ code-switch-handled
✓ english-output
French prompt generation fr → en · prompt 1,230 ms ✓ english-prompt
✓ actionable

Sample run from the offline eval pipeline (npm run eval:verba). Same harness runs against OpenAI, Anthropic, and Workers AI providers in production; per-provider comparison data is exported as JSONL on demand. The single-output failure case (same-language-rewrite) is exactly the kind of falsifiable check the rubric is designed to surface — the model returned a faithful Mandarin output but did not perform the rewrite step. Full report: /verba-benchmark-report.md.

Diff rendering

Word-level Myers diff with a 200ms debounce. Anything finer felt jittery; anything coarser hid the model's actual edits.

Streaming

Server-sent events all the way. WebSockets were tempting but added a connection-state surface I didn't want for a single-direction stream.

1,961ms
avg eval latency · local provider
5
languages eval-tested · en zh es fr pt
JSONL
exportable per-run + per-provider

What's next

Multi-turn evals. The current rubric is single-shot. The honest version of "is this model better" is multi-turn, and the workspace should make that the path of least resistance.

The workspace is the eval. If shipping the eval is hard, you'll ship without one. — readme, current