Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.goyappr.com/llms.txt

Use this file to discover all available pages before exploring further.

Agent eval lets you exercise your voice agents the way unit tests exercise code. A test “caller” (a persona LLM) talks to your agent (the production agent LLM); the conversation is scored against assertions you wrote up front. No phone numbers, no carrier minutes — just LLM tokens. Use it to:
  • Catch regressions before they ship — run the same suite of cases after every prompt change or flow edit and watch the pass/fail diff.
  • Probe edge cases at scale — angry caller, mumbling caller, caller who switches language mid-call, caller who already knows the answer.
  • Audit behaviour — prove to a customer or auditor that the agent hits required compliance phrases on every call.
  • Fail PRs in CI — block a deploy when the score drops below threshold.

The four pieces

ConceptStored inPurpose
Personaeval_personasReusable caller archetype. One persona, many cases.
Caseeval_casesA specific scenario: which persona, which agent, what scenario, what success looks like.
Suiteeval_suitesA bundle of cases run together as one batch.
Runeval_runs (+ eval_run_turns)The result of executing a case once: transcript, score, pass/fail, cost.
The relationships:
Persona  ─┐
          ├─►  Case  ─►  Run  (1 case → many runs over time)
Agent    ─┘     │
                └─►  Suite (a case can belong to one suite, or be ad-hoc)
A run is always against a specific case. To run a one-off check, create the case first. To re-run a regression sweep, run the suite that wraps the cases.

Designing a persona

Personas are reusable — write them carefully. A persona has:
  • identity_prompt — second-person (“You are a 34-year-old tenant…”). Include the role, the situation, what brought them to the call.
  • behavior_traits — JSON knobs the persona LLM is told to honour: patience, verbosity, cooperation, interruption_tendency, goal. Free-form — agents will read whatever keys you provide.
  • languagehe or en. Set this to match the agent under test.
Tips:
  • Keep identity_prompt under ~120 words. Long prompts make the persona LLM rigid and break the illusion.
  • Put adjustable axes in behavior_traits, not in the prompt. That way you can have one identity (“frustrated tenant”) and several variants (“patient version”, “impatient version”, “screaming version”) without copy-pasting.
  • Don’t pre-write the script for the persona. Describe their goal and let them improvise.

Writing success criteria

Every case has an array of assertions evaluated after the run. Each assertion has a type, a weight, and type-specific fields. The case passes when weighted_score >= pass_threshold (default 80).
TypeWhat it checksRequired fields
must_sayAgent’s transcript matches the patternpattern (regex, case-insensitive — plain substrings work)
must_not_sayAgent’s transcript does NOT match the patternsame as must_say
must_call_toolAgent invoked the named tool at least oncetool_name (e.g. bookAppointment); optional args_match for arg-shape checks
must_reach_nodeFlow agent visited the named nodenode_id (flow agents only — silently fails on prompt-mode agents)
custom_llm_judgeA judge LLM rates the run against a rubricrubric (free-form natural language) — engine stub in v1; full LLM-as-judge ships in v1.1
The case-level max_turns field (default 20) is a hard safety cap on the conversation, not a success criterion — if the conversation runs that long the run terminates with termination_reason: max_turns. Weights default to 1.0. A case with three weight-1 assertions and one weight-3 assertion has a max score of 6; the heavy assertion is worth half the case on its own.

Polling contract: status vs queue_status

Two lifecycle fields exist on every run; they update at different moments:
  • status — what the customer-facing UI shows. Flips to completed / failed / cancelled the moment the conversation ends.
  • queue_status — internal worker pipeline state (pendingclaimeddone). Only after queue_status === "done" are the scoring + billing fields written (score, pass_fail, evaluation, total_cost_cents).
Always wait for queue_status === "done" before reading scoring or cost fields. The transient window between status === "completed" and queue_status === "done" is typically under 5 seconds but can be longer under contention. Reading score early returns null, not 0 — which can silently fail CI checks if you aren’t careful. POST /agent-eval/cases/:id/run already blocks up to 60 seconds for short single-case runs and returns 200 only when fully done (queue_status === "done"). It returns 202 and the latest state if it times out. Suite-run is always async (202) — poll the aggregate.

CI integration

The simplest pattern: run the suite, poll until done, check pass rate. Eval runs do not emit webhooks today — polling is the only mechanism.
SUITE_ID=...
RUN=$(curl -s -X POST "https://api.goyappr.com/agent-eval/suites/$SUITE_ID/run" \
  -H "Authorization: Bearer $YAPPR_API_KEY")
SUITE_RUN_ID=$(echo "$RUN" | jq -r .suite_run_id)

# Poll — wait for every run's queue_status to reach "done"
while true; do
  PENDING=$(curl -s "https://api.goyappr.com/agent-eval/runs?suite_run_id=$SUITE_RUN_ID" \
    -H "Authorization: Bearer $YAPPR_API_KEY" \
    | jq '[.data[] | select(.queue_status != "done")] | length')
  if [ "$PENDING" = "0" ]; then break; fi
  sleep 5
done

# Score — safe to read scoring fields now
PASS_RATE=$(curl -s "https://api.goyappr.com/agent-eval/runs?suite_run_id=$SUITE_RUN_ID" \
  -H "Authorization: Bearer $YAPPR_API_KEY" \
  | jq '[.data[] | .pass_fail] | (map(select(.==true)) | length) / length * 100')

[ $(echo "$PASS_RATE >= 90" | bc) = "1" ] || exit 1

Billing

Eval runs are charged against your normal credit balance. The current public rate card:
RoleInputOutput
Agent$2 / 1M tokens$10 / 1M tokens
Persona$1 / 1M tokens$4 / 1M tokens
These are user-facing prices (covering the AI cost plus margin), not provider costs. A typical 10-turn case lands between 0.005and0.005 and 0.05 depending on prompt length. A 50-case regression suite for under a dollar is normal. The total_cost_cents field on each run is the exact amount debited. After-the-fact roll-ups are available via GET /billing/consumption?product=eval_run.

Failure modes worth understanding

  • status: failed — the worker hit an unrecoverable error (model 5xx, malformed flow_config, persona LLM refused). The error field has the diagnostic. The run is partially billed for any turns produced before the error.
  • status: completed + pass_fail: false — the run executed cleanly but didn’t meet the pass threshold. This is the case you want to investigate.
  • termination_reason: max_turns — the conversation hit the case’s max_turns cap. Often a sign the persona is adversarial enough that the agent never reaches a closing node, or the case’s success criteria are unreachable.
When debugging, fetch the turns (GET /agent-eval/runs/:id/turns) and walk them top to bottom. For flow agents, the flow_event rows mirror the flow’s routing decisions and are usually where bugs hide.

API surface

Method & pathPurpose
GET/POST/PATCH/DELETE /agent-eval/personas[/:id]Persona CRUD
GET/POST/PATCH/DELETE /agent-eval/cases[/:id]Case CRUD
GET/POST/PATCH/DELETE /agent-eval/suites[/:id]Suite CRUD
POST /agent-eval/suites/:id/runRun a whole suite
GET/POST /agent-eval/runs[/:id]List runs, run a single case
GET /agent-eval/runs/:id/turnsFull transcript
GET /agent-eval/runs/:id/evaluationJust the score + assertion results
POST /agent-eval/runs/:id/cancelBest-effort cancel
GET /billing/consumption?product=eval_runSpend roll-up