Skip to main content

Testing

179 unit tests across 19 files (Vitest) + 10 end-to-end tests across 2 files (Playwright), all offline (no real API keys needed). The suite proves the architecture hangs together and guards the contracts that matter.

Run them

npm test # vitest run — the 182 unit tests, one shot
npm run test:watch # vitest watch
npm run test:e2e # playwright — the 10 e2e tests (boots the Vite dev server)
npm run test:all # vitest + playwright together

Unit tests (Vitest)

SuiteFileTestsGuards
Graderssrc/tasks/graders.test.ts42Every grader (full/partial/wrong), tolerance, boolEq, the full 111-scenario pool Zod-validity, correct-outcome metadata
Worker helpersworker/worker.test.ts13Schema resolution, model building per provider, toModelMessages (system split + image_url→image), SSE re-wrap framing + reconstruction + error-isolation
Worker handlerworker/handler.test.ts10Request validation: bad provider/role, message schema, SSRF image_url rejection, keys never leak + placeholder-model flag in /api/config, system-only rejection
Scoringsrc/engine/scoring.test.ts8scoreItem answer-key stamping, all aggregates
Arrival pumpsrc/engine/arrivalPump.test.ts9Weighted draw, difficulty spread, adversarial bias, seedQueue
Fairnesssrc/engine/fairness.test.ts8Lanes receive identical scenarios; clones independent; pool never exhausts; mode presets
parseOutputsrc/agents/createAgent.test.ts13Fence/prose/coercion/malformed tolerance, tracking-number protection, single-element-array enum unwrap (Cerebras ["jam"]"jam") while preserving genuine array fields
Image → data URLsrc/agents/image.test.ts6base64 conversion, memoized cache, asset resolution
Prompt assemblysrc/tasks/prompt.test.ts7system prompt, schema-to-instruction, multimodal message building
Registrysrc/tasks/registry.test.ts7getTaskType/findTaskType/allTaskTypes/taskIds, unknown-id handling
Clientssrc/agents/clients.test.ts8Mock determinism + errorRate, human resolver contract
Policy + tracesrc/orchestrator/policy.test.ts13shouldRetry/shouldEscalate all branches, ROI math, trace
Pipelinesrc/orchestrator/pipeline.test.ts4Single-agent collapse without a provider; router dispatch
Verdictsrc/orchestrator/verdict.test.ts3The shared deriveVerdict matches every scenario's stamped correctOutcome.verdict; per-task disposition rules; clean escalation ≠ blanket hold
Agent rostersrc/data/agentRoster.test.ts9Roster covers all roles + steps; every task has humanControls matching focusFields
Arena storesrc/store/arena.test.ts7phase transitions, start/reset/endEarly, run config
Human inputsrc/stage/humanInput.test.ts4Resolver contract, skip/forfeit, safety timeout
Engine integrationsrc/engine/loop.test.ts9Full mock race to completion; endless; fairness convergence; provider-error surfacing; offline circuit breaker; pacemaker arrivals (fastest lane never backlogs)
Human lane integrationsrc/stage/humanLane.test.ts2A human lane completes items with answer-key stats populated

End-to-end tests (Playwright)

npm run test:e2e boots the Vite dev server on port 5199 and drives a real browser through the user-visible journey (mock mode, no API keys). Two suites:

SuiteFileTestsCovers
The core race flowe2e/race.spec.ts7Lobby logo + task explorer + GO; browsing a task's scenarios; the agent-roster panel; GO starts a race (lanes, scoreboard, controls); End scores + winner banner; Reset to lobby; a 15s blitz auto-ends
The human lanee2e/human.spec.ts3"I Wanna Play" shows the inline human lane panel during a race; submitting clears the question; Skip forfeits the parcel

The load-bearing tests

A few tests are worth understanding because they guard the demo's integrity:

  • fairness.test.ts — asserts nextArrival(['a','b','c']) returns clones with identical taskTypeId/groundTruth/correctOutcome. If this breaks, the race isn't fair.
  • loop.test.ts "runs a blitz race to completion" — fast-forwards a full mock race with fake timers. If any layer (pump → pipeline → grader → scoring) is broken, the race never completes or the numbers are wrong.
  • graders.test.ts "every scenario validates against its schema" — runs the loader over all 111 scenarios. A malformed scenario fails here, at test time, not mid-demo.
  • handler.test.ts "never leaks keys" — asserts /api/config returns no key material.
  • e2e/race.spec.ts "End scores the race and shows the winner banner" — boots the real app, runs a mock race, and asserts the win banner renders. If the store → engine → UI wiring is broken end-to-end, this fails.

Test conventions

  • No network. Real model calls are stubbed (mock clients, fake timers, a keyed-env stub for the Worker). The Vitest suite runs in ~1s.
  • Determinism where it matters. Math.random is stubbed via vi.spyOn for the weighted-draw and errorRate tests; statistical tests use wide tolerances.
  • Fake timers (vi.useFakeTimers({ shouldAdvanceTime: true })) drive the engine e2e so a 15s race completes in milliseconds.
  • Playwright e2e is a separate command (npm run test:e2e) because it boots the real Vite dev server and drives a real browser. It runs in mock mode so no keys are needed; selectors are scoped by role/title to stay robust.

Type checking

npm run typecheck # tsc -b for the app, separate for the worker

Both the app (tsconfig.app.json) and Worker (worker/tsconfig.json) are strict. The build (npm run build) typechecks before bundling.