Agent Eval

Agent eval lets you exercise your voice agents the way unit tests exercise code. A test “caller” (a persona LLM) talks to your agent (the production agent LLM); the conversation is scored against assertions you wrote up front. No phone numbers, no carrier minutes — just LLM tokens. Use it to:

Catch regressions before they ship — run the same suite of cases after every prompt change or flow edit and watch the pass/fail diff.
Probe edge cases at scale — angry caller, mumbling caller, caller who switches language mid-call, caller who already knows the answer.
Audit behaviour — prove to a customer or auditor that the agent hits required compliance phrases on every call.
Fail PRs in CI — block a deploy when the score drops below threshold.

The four pieces

Concept	Stored in	Purpose
Persona	`eval_personas`	Reusable caller archetype. One persona, many cases.
Case	`eval_cases`	A specific scenario: which persona, which agent, what scenario, what success looks like.
Suite	`eval_suites`	A bundle of cases run together as one batch.
Run	`eval_runs` (+ `eval_run_turns`)	The result of executing a case once: transcript, score, pass/fail, cost.

The relationships:

Persona  ─┐
          ├─►  Case  ─►  Run  (1 case → many runs over time)
Agent    ─┘     │
                └─►  Suite (a case can belong to one suite, or be ad-hoc)

A run is always against a specific case. To run a one-off check, create the case first. To re-run a regression sweep, run the suite that wraps the cases.

Designing a persona

Personas are reusable — write them carefully. A persona has:

identity_prompt — second-person (“You are a 34-year-old tenant…”). Include the role, the situation, what brought them to the call.
behavior_traits — JSON knobs the persona LLM is told to honour: patience, verbosity, cooperation, interruption_tendency, goal. Free-form — agents will read whatever keys you provide.
language — he or en. Set this to match the agent under test.

Tips:

Keep identity_prompt under ~120 words. Long prompts make the persona LLM rigid and break the illusion.
Put adjustable axes in behavior_traits, not in the prompt. That way you can have one identity (“frustrated tenant”) and several variants (“patient version”, “impatient version”, “screaming version”) without copy-pasting.
Don’t pre-write the script for the persona. Describe their goal and let them improvise.

Writing success criteria

Every case has an array of assertions evaluated after the run. Each assertion has a type, a weight, and type-specific fields. The case passes when weighted_score >= pass_threshold (default 80).

Type	What it checks	Required fields
`must_say`	Agent’s transcript matches the pattern	`pattern` (regex, case-insensitive — plain substrings work)
`must_not_say`	Agent’s transcript does NOT match the pattern	same as `must_say`
`must_call_tool`	Agent invoked the named tool at least once	`tool_name` (e.g. `bookAppointment`); optional `args_match` for arg-shape checks
`must_reach_node`	Flow agent visited the named node	`node_id` (flow agents only — silently fails on prompt-mode agents)
`custom_llm_judge`	A judge LLM rates the run against a rubric	`rubric` (free-form natural language) — engine stub in v1; full LLM-as-judge ships in v1.1

The case-level max_turns field (default 20) is a hard safety cap on the conversation, not a success criterion — if the conversation runs that long the run terminates with termination_reason: max_turns. Weights default to 1.0. A case with three weight-1 assertions and one weight-3 assertion has a max score of 6; the heavy assertion is worth half the case on its own.

Polling contract: `status` vs `queue_status`

Two lifecycle fields exist on every run; they update at different moments:

status — what the customer-facing UI shows. Flips to completed / failed / cancelled the moment the conversation ends.
queue_status — internal worker pipeline state (pending → claimed → done). Only after queue_status === "done" are the scoring + billing fields written (score, pass_fail, evaluation, total_cost_cents).

Always wait for queue_status === "done" before reading scoring or cost fields. The transient window between status === "completed" and queue_status === "done" is typically under 5 seconds but can be longer under contention. Reading score early returns null, not 0 — which can silently fail CI checks if you aren’t careful. POST /agent-eval/cases/:id/run already blocks up to 60 seconds for short single-case runs and returns 200 only when fully done (queue_status === "done"). It returns 202 and the latest state if it times out. Suite-run is always async (202) — poll the aggregate.

CI integration

The simplest pattern: run the suite, poll until done, check pass rate. Eval runs do not emit webhooks today — polling is the only mechanism.

SUITE_ID=...
RUN=$(curl -s -X POST "https://api.goyappr.com/agent-eval/suites/$SUITE_ID/run" \
  -H "Authorization: Bearer $YAPPR_API_KEY")
SUITE_RUN_ID=$(echo "$RUN" | jq -r .suite_run_id)

# Poll — wait for every run's queue_status to reach "done"
while true; do
  PENDING=$(curl -s "https://api.goyappr.com/agent-eval/runs?suite_run_id=$SUITE_RUN_ID" \
    -H "Authorization: Bearer $YAPPR_API_KEY" \
    | jq '[.data[] | select(.queue_status != "done")] | length')
  if [ "$PENDING" = "0" ]; then break; fi
  sleep 5
done

# Score — safe to read scoring fields now
PASS_RATE=$(curl -s "https://api.goyappr.com/agent-eval/runs?suite_run_id=$SUITE_RUN_ID" \
  -H "Authorization: Bearer $YAPPR_API_KEY" \
  | jq '[.data[] | .pass_fail] | (map(select(.==true)) | length) / length * 100')

[ $(echo "$PASS_RATE >= 90" | bc) = "1" ] || exit 1

Billing

Eval runs are charged against your normal credit balance. The current public rate card:

Role	Input	Output
Agent	$2 / 1M tokens	$10 / 1M tokens
Persona	$1 / 1M tokens	$4 / 1M tokens

These are user-facing prices (covering the AI cost plus margin), not provider costs. A typical 10-turn case lands between

0.005 and

0.05 depending on prompt length. A 50-case regression suite for under a dollar is normal. The total_cost_cents field on each run is the exact amount debited. After-the-fact roll-ups are available via GET /billing/consumption?product=eval_run.

Failure modes worth understanding

status: failed — the worker hit an unrecoverable error (model 5xx, malformed flow_config, persona LLM refused). The error field has the diagnostic. The run is partially billed for any turns produced before the error.
status: completed + pass_fail: false — the run executed cleanly but didn’t meet the pass threshold. This is the case you want to investigate.
termination_reason: max_turns — the conversation hit the case’s max_turns cap. Often a sign the persona is adversarial enough that the agent never reaches a closing node, or the case’s success criteria are unreachable.

When debugging, fetch the turns (GET /agent-eval/runs/:id/turns) and walk them top to bottom. For flow agents, the flow_event rows mirror the flow’s routing decisions and are usually where bugs hide.

API surface

Method & path	Purpose
`GET/POST/PATCH/DELETE /agent-eval/personas[/:id]`	Persona CRUD
`GET/POST/PATCH/DELETE /agent-eval/cases[/:id]`	Case CRUD
`GET/POST/PATCH/DELETE /agent-eval/suites[/:id]`	Suite CRUD
`POST /agent-eval/suites/:id/run`	Run a whole suite
`GET/POST /agent-eval/runs[/:id]`	List runs, run a single case
`GET /agent-eval/runs/:id/turns`	Full transcript
`GET /agent-eval/runs/:id/evaluation`	Just the score + assertion results
`POST /agent-eval/runs/:id/cancel`	Best-effort cancel
`GET /billing/consumption?product=eval_run`	Spend roll-up

Get Started

Concepts

The four pieces

Designing a persona

Writing success criteria

Polling contract: `status` vs `queue_status`

CI integration

Billing

Failure modes worth understanding

API surface

Get Started

Concepts

Documentation Index

​The four pieces

​Designing a persona

​Writing success criteria

​Polling contract: status vs queue_status

​CI integration

​Billing

​Failure modes worth understanding

​API surface

The four pieces

Designing a persona

Writing success criteria

Polling contract: `status` vs `queue_status`

CI integration

Billing

Failure modes worth understanding

API surface