Skip to main content

Authoring Scenario Data

Scenario data is separate from code — one JSON file per task type in data/scenarios/. This keeps content authoring decoupled from the build and lets non-engineers add cases.

The file convention

data/scenarios/<task-id>.json — a JSON array of scenarios. The loader (src/data/scenarios.ts) auto-discovers every *.json in that folder via import.meta.glob, so dropping in a new file is enough; no registry edit.

Scenario shape

[
{
"id": "mt-1",
"taskTypeId": "my-task",
"difficulty": 2,
"adversarial": false,
"input": {
"text": "Classify this item: ...",
"imageUrl": "/data/assets/my-task/mt-1.png",
"documents": ["--- optional doc ---\n..."]
},
"groundTruth": { "result": "allowed", "confidence": 0.9 },
"correctOutcome": { "verdict": "accept", "pass": true, "summary": "allowed item" },
"blurb": "one-line lobby description"
}
]
FieldRequiredNotes
idyesunique, e.g. mt-1
taskTypeIdyesmust match a registered task
difficultyyes1, 2, or 3
input.textone ofthe prompt text the worker sees
input.imageUrlfor visionasset path (served from /data/assets/)
input.documentsoptionalattached doc text
groundTruthyesthe authored answer — must validate against the task schema
correctOutcomeyesthe answer key (verdict / pass / summary)
adversarialoptionalmarks highlight-reel cases; drawn ~22% of the time
blurboptionallobby explorer description

Truth-first discipline

Decide the ground-truth object first, then write the input to match it. The label is authored, not back-inferred from text the model might write. E.g. decide "crushed, severity 4, refuse," then write the matching parcel description / image prompt.

Correct outcomes

Every scenario must declare a correctOutcome. The verdict is the operational decision the agent should stamp; pass is whether resolving counts (refused parcels are pass: false — rejecting them isn't "clearing" them). You can derive these in bulk with the migration script:

npx tsx scripts/add-correct-outcome.ts

It reads each scenario's ground truth, computes the verdict deterministically via the shared deriveVerdict (src/orchestrator/verdict.ts) — the same function the engine stamps with, so the answer key and the engine can't drift — and stamps correctOutcome. Re-runnable and idempotent — useful after authoring new scenarios.

Validation is enforced

The loader Zod-validates both the envelope (incl. correctOutcome) and the groundTruth against the task's output schema. A malformed scenario fails the build with a precise error:

[scenarios] invalid scenario pool:
- data/scenarios/my-task.json → mt-3: groundTruth.confidence: expected number, received string

So you can't ship a broken pool by accident.

Generating scenarios at scale

For a large batch (the spec targets 8–15 per task type), use the generator:

export ANTHROPIC_API_KEY=sk-ant-...
npm run datagen -- <task-id> [count] # e.g. npm run datagen -- tariff-classification 6
npm run datagen -- --list # list valid task ids

scripts/generate.ts feeds the task's Zod schema to Claude (via the official Anthropic SDK, structured output) as the contract, asks it to emit ground-truth-first across difficulty tiers with adversarial cases salted in, Zod-validates every candidate against the task schema, stamps correctOutcome, and merges new scenarios into data/scenarios/<task-id>.json (skipping the build-breaking invalids). It writes only text/documents scenarios; vision tasks still need an asset-production step (the image must genuinely match the label) — run npm run gen:assets and wire imageUrl by hand for those.