Documentation Index
Fetch the complete documentation index at: https://docs.goyappr.com/llms.txt
Use this file to discover all available pages before exploring further.
Agent eval lets you exercise your voice agents the way unit tests exercise code. A test “caller” (a persona LLM) talks to your agent (the production agent LLM); the conversation is scored against assertions you wrote up front. No phone numbers, no carrier minutes — just LLM tokens.
Use it to:
- Catch regressions before they ship — run the same suite of cases after every prompt change or flow edit and watch the pass/fail diff.
- Probe edge cases at scale — angry caller, mumbling caller, caller who switches language mid-call, caller who already knows the answer.
- Audit behaviour — prove to a customer or auditor that the agent hits required compliance phrases on every call.
- Fail PRs in CI — block a deploy when the score drops below threshold.
The four pieces
| Concept | Stored in | Purpose |
|---|
| Persona | eval_personas | Reusable caller archetype. One persona, many cases. |
| Case | eval_cases | A specific scenario: which persona, which agent, what scenario, what success looks like. |
| Suite | eval_suites | A bundle of cases run together as one batch. |
| Run | eval_runs (+ eval_run_turns) | The result of executing a case once: transcript, score, pass/fail, cost. |
The relationships:
Persona ─┐
├─► Case ─► Run (1 case → many runs over time)
Agent ─┘ │
└─► Suite (a case can belong to one suite, or be ad-hoc)
A run is always against a specific case. To run a one-off check, create the case first. To re-run a regression sweep, run the suite that wraps the cases.
Designing a persona
Personas are reusable — write them carefully. A persona has:
identity_prompt — second-person (“You are a 34-year-old tenant…”). Include the role, the situation, what brought them to the call.
behavior_traits — JSON knobs the persona LLM is told to honour: patience, verbosity, cooperation, interruption_tendency, goal. Free-form — agents will read whatever keys you provide.
language — he or en. Set this to match the agent under test.
Tips:
- Keep
identity_prompt under ~120 words. Long prompts make the persona LLM rigid and break the illusion.
- Put adjustable axes in
behavior_traits, not in the prompt. That way you can have one identity (“frustrated tenant”) and several variants (“patient version”, “impatient version”, “screaming version”) without copy-pasting.
- Don’t pre-write the script for the persona. Describe their goal and let them improvise.
Writing success criteria
Every case has an array of assertions evaluated after the run. Each assertion has a type, a weight, and type-specific fields. The case passes when weighted_score >= pass_threshold (default 80).
| Type | What it checks | Required fields |
|---|
must_say | Agent’s transcript matches the pattern | pattern (regex, case-insensitive — plain substrings work) |
must_not_say | Agent’s transcript does NOT match the pattern | same as must_say |
must_call_tool | Agent invoked the named tool at least once | tool_name (e.g. bookAppointment); optional args_match for arg-shape checks |
must_reach_node | Flow agent visited the named node | node_id (flow agents only — silently fails on prompt-mode agents) |
custom_llm_judge | A judge LLM rates the run against a rubric | rubric (free-form natural language) — engine stub in v1; full LLM-as-judge ships in v1.1 |
The case-level max_turns field (default 20) is a hard safety cap on the conversation, not a success criterion — if the conversation runs that long the run terminates with termination_reason: max_turns.
Weights default to 1.0. A case with three weight-1 assertions and one weight-3 assertion has a max score of 6; the heavy assertion is worth half the case on its own.
Polling contract: status vs queue_status
Two lifecycle fields exist on every run; they update at different moments:
status — what the customer-facing UI shows. Flips to completed / failed / cancelled the moment the conversation ends.
queue_status — internal worker pipeline state (pending → claimed → done). Only after queue_status === "done" are the scoring + billing fields written (score, pass_fail, evaluation, total_cost_cents).
Always wait for queue_status === "done" before reading scoring or cost fields. The transient window between status === "completed" and queue_status === "done" is typically under 5 seconds but can be longer under contention. Reading score early returns null, not 0 — which can silently fail CI checks if you aren’t careful.
POST /agent-eval/cases/:id/run already blocks up to 60 seconds for short single-case runs and returns 200 only when fully done (queue_status === "done"). It returns 202 and the latest state if it times out. Suite-run is always async (202) — poll the aggregate.
CI integration
The simplest pattern: run the suite, poll until done, check pass rate. Eval runs do not emit webhooks today — polling is the only mechanism.
SUITE_ID=...
RUN=$(curl -s -X POST "https://api.goyappr.com/agent-eval/suites/$SUITE_ID/run" \
-H "Authorization: Bearer $YAPPR_API_KEY")
SUITE_RUN_ID=$(echo "$RUN" | jq -r .suite_run_id)
# Poll — wait for every run's queue_status to reach "done"
while true; do
PENDING=$(curl -s "https://api.goyappr.com/agent-eval/runs?suite_run_id=$SUITE_RUN_ID" \
-H "Authorization: Bearer $YAPPR_API_KEY" \
| jq '[.data[] | select(.queue_status != "done")] | length')
if [ "$PENDING" = "0" ]; then break; fi
sleep 5
done
# Score — safe to read scoring fields now
PASS_RATE=$(curl -s "https://api.goyappr.com/agent-eval/runs?suite_run_id=$SUITE_RUN_ID" \
-H "Authorization: Bearer $YAPPR_API_KEY" \
| jq '[.data[] | .pass_fail] | (map(select(.==true)) | length) / length * 100')
[ $(echo "$PASS_RATE >= 90" | bc) = "1" ] || exit 1
Billing
Eval runs are charged against your normal credit balance. The current public rate card:
| Role | Input | Output |
|---|
| Agent | $2 / 1M tokens | $10 / 1M tokens |
| Persona | $1 / 1M tokens | $4 / 1M tokens |
These are user-facing prices (covering the AI cost plus margin), not provider costs. A typical 10-turn case lands between 0.005and0.05 depending on prompt length. A 50-case regression suite for under a dollar is normal.
The total_cost_cents field on each run is the exact amount debited. After-the-fact roll-ups are available via GET /billing/consumption?product=eval_run.
Failure modes worth understanding
status: failed — the worker hit an unrecoverable error (model 5xx, malformed flow_config, persona LLM refused). The error field has the diagnostic. The run is partially billed for any turns produced before the error.
status: completed + pass_fail: false — the run executed cleanly but didn’t meet the pass threshold. This is the case you want to investigate.
termination_reason: max_turns — the conversation hit the case’s max_turns cap. Often a sign the persona is adversarial enough that the agent never reaches a closing node, or the case’s success criteria are unreachable.
When debugging, fetch the turns (GET /agent-eval/runs/:id/turns) and walk them top to bottom. For flow agents, the flow_event rows mirror the flow’s routing decisions and are usually where bugs hide.
API surface
| Method & path | Purpose |
|---|
GET/POST/PATCH/DELETE /agent-eval/personas[/:id] | Persona CRUD |
GET/POST/PATCH/DELETE /agent-eval/cases[/:id] | Case CRUD |
GET/POST/PATCH/DELETE /agent-eval/suites[/:id] | Suite CRUD |
POST /agent-eval/suites/:id/run | Run a whole suite |
GET/POST /agent-eval/runs[/:id] | List runs, run a single case |
GET /agent-eval/runs/:id/turns | Full transcript |
GET /agent-eval/runs/:id/evaluation | Just the score + assertion results |
POST /agent-eval/runs/:id/cancel | Best-effort cancel |
GET /billing/consumption?product=eval_run | Spend roll-up |