Verba — case study

Working across six natural languages, you notice the same failure mode in every speech-to-text product on the market: confidence is hidden, ambiguity is silently resolved, and the draft you receive looks confident even when the model isn't. Verba is the workspace built around the opposite assumption — preserve the raw, surface the doubt, and make the eval loop the artifact.

The problem

Most speech-to-draft tools optimise for a single output: a polished paragraph. That's the wrong unit. The unit that matters is the trace from raw audio to a written artifact you'd actually publish. Drop the trace and you can't tell whether the model heard you correctly, whether it translated faithfully, whether the draft is a clean summary or a confident hallucination. For an operator working across English, 中文, español, français, and português, that gap is unworkable.

The shape

The flow is sequential but every artifact compounds. Each pane hands its output to the next without a copy-paste, and the eval pane sees all of it.

Transcribe — drop audio, get text with confidence-scored segments. Whisper-style on Workers AI; chunked uploads to R2.
Polish — three deliberate modes: rewrite, translate, prompt-generate. Diff highlighted inline against the raw transcript.
Compare — run the polished prompt across multiple providers in parallel via /api/compare. Side-by-side diff. Cost and latency captured.
Log eval — fixed-rubric scoring (faithfulness, concision, structure, terminology). Persisted to D1. JSONL out for downstream analysis.

The eval log is the artifact. Everything before it is scaffolding to make the log honest. — working notes

Constraints I picked

No build step beyond Vite. No framework. No backend other than Cloudflare. The same edge that serves joaquinh.com runs the inference. This was a deliberate choice — a cheap, well-understood edge stack is the one I'd actually push to production. If I can't build a workspace I trust on it, the "boring stack wins" thesis is just a posture.

Multilingual is not a feature; it's the test

If your speech-to-draft tool only works in English, you're not building for the world your users live in. Verba runs the eval suite in en/zh/es/fr/pt with code-switched audio, and the rubric weights faithfulness above concision deliberately — a polished mistranslation is worse than a clumsy accurate one.

Sample eval run · 5 multilingual cases

2026-03-19 · provider: local llama3.2 · cost $0.00

5 / 5cases run · zero failures

1,961 msavg latency · single-shot

86 %mandarin pass rate (4 cases)

100 %french pass rate (1 case)

case	lang pair	latency	checks
Traditional Mandarin stays Traditional	zh → zh · draft	3,423 ms	✓ traditional-script-preserved ✗ same-language-rewrite
Mandarin to Spanish	zh → es · draft	1,624 ms	✓ spanish-output ✓ no-chinese-leftover
Colloquial Mandarin with profanity	zh → en · draft	2,204 ms	✓ policy-behavior-visible ○ meaning-preserved (manual) ○ no-random-hallucination (manual)
Code-switched team update	zh → en · draft	1,325 ms	✓ code-switch-handled ✓ english-output
French prompt generation	fr → en · prompt	1,230 ms	✓ english-prompt ✓ actionable

Sample run from the offline eval pipeline (npm run eval:verba). Same harness runs against OpenAI, Anthropic, and Workers AI providers in production; per-provider comparison data is exported as JSONL on demand. The single-output failure case (same-language-rewrite) is exactly the kind of falsifiable check the rubric is designed to surface — the model returned a faithful Mandarin output but did not perform the rewrite step. Full report: /verba-benchmark-report.md.

Diff rendering

Word-level Myers diff with a 200ms debounce. Anything finer felt jittery; anything coarser hid the model's actual edits.

Streaming

Server-sent events all the way. WebSockets were tempting but added a connection-state surface I didn't want for a single-direction stream.

1,961ms

avg eval latency · local provider

languages eval-tested · en zh es fr pt

JSONL

exportable per-run + per-provider

What's next

Multi-turn evals. The current rubric is single-shot. The honest version of "is this model better" is multi-turn, and the workspace should make that the path of least resistance.

The workspace is the eval. If shipping the eval is hard, you'll ship without one. — readme, current