A voice bot QA checklist for 2026

By Phil Smith

·9 May 2026·6 min read

A complete voice bot QA checklist for 2026 has twelve checks across three layers: conversational behaviour, telephony health, and regulatory compliance. Most teams have a partial version of layer one. Layers two and three are where production failures actually live, and they are the layers that legacy QA processes were never designed to cover.

The full list is below. We use this internally on every voice bot we test, and it is the same shape as TotalPath's regression and compliance suites. Borrow it freely.

Layer 1: conversational behaviour

These are the checks most teams already do, even if not systematically.

1. Greeting and intent capture. The bot answers within 1.5 seconds. The first turn is intelligible. The bot correctly captures the caller's stated intent and routes to the right path. Failure shape: greeting stretches under load; bot misroutes ambiguous intents.

2. Authentication and identity. If the bot authenticates callers, it does so without leaking information about valid vs invalid identifiers. It enforces lockouts. It does not read back PII to an unauthenticated caller. Failure shape: bot confirms a customer exists by saying "I can't verify that, please try again with a different account number."

3. Happy-path completion. The bot completes the canonical journey end-to-end without human intervention. The success state is reached. Any backend writes (CRM updates, payments, bookings) actually persist. Failure shape: bot says "I've booked that for you" when the booking API returned a 500.

4. Recovery from misrecognition. When ASR returns a low-confidence result or the caller misspeaks, the bot recovers gracefully — confirmation, clarification, or escalation — rather than re-asking the same question repeatedly. Failure shape: three "sorry, I didn't catch that" turns in a row before the bot dies in the loop.

5. Handoff to a human. When a caller asks for a person, or when the bot determines escalation is warranted, the handoff completes. Context is preserved. The human picks up with the relevant information already on screen, not blank. Failure shape: handoff drops the call entirely; or the human picks up cold and the customer has to start again.

Layer 2: telephony health

These are the checks that synthetic API testing never catches.

6. Time-to-greet. Wall-clock time between call answer and first audible bot prompt. Healthy is under 1.5 seconds. Above 3 seconds you are losing customers. Failure shape: TTS cold-start; queue depth growing under load; interim silence misinterpreted by the customer as a dead line.

7. End-to-end turn latency. Time from end of caller speech to start of bot response. Healthy is 600–1500 ms. Above 3 seconds the bot has lost the rhythm of natural conversation. Failure shape: callers talk over the bot; bot misses turns; conversation collapses.

8. ASR confidence under load. Speech recognition confidence stays within range as concurrency increases. Failure shape: ASR confidence silently degrades at peak; bot disambiguates more often; customers repeat themselves; complaints about "the system is bad today."

9. Audio quality and dropouts. No clipping, no excessive background hiss, no truncated prompts, no template variables leaking into spoken output. We have heard a production bot say "two end underscore call" on every turn for sixty seconds. Failure shape: TTS prompt cut off mid-sentence; unresolved templates; codec mismatch on certain carriers.

10. PSTN reachability. The number is reachable from the carriers your customers actually use, in the regions you actually serve. Failure shape: the number routes from one mobile network and not another; international callers get carrier-blocked; toll-free works from landlines but not VoIP.

Layer 3: regulatory compliance

These are the checks that audit committees ask about and that nobody is continuously running.

11. AI disclosure. If the caller is talking to an AI, the bot says so unambiguously, in its first turn, in language a reasonable caller would understand. (EU AI Act Article 50.) Pass condition: explicit AI self-identification before the bot asks for any information. Failure shape: "Hi, this is Sarah, how can I help?" with no AI clarification anywhere in the call. Also: the bot only discloses AI status when directly asked.

12. Recording and data-use disclosure. The bot discloses recording status accurately and signposts the privacy notice. Marketing opt-outs are honoured if the caller has previously opted out. Vulnerable callers are routed appropriately and not subjected to manufactured urgency. Failure shape: no recording disclosure; persistent retention without consent; the bot reads marketing scripts to a caller on the no-call list.

These are the headline regulatory checks. Depending on your sector, the full list is longer. EU AI Act, FCA Consumer Duty PRIN 2A, FG21/1 (vulnerable callers), Ofcom GC, UK Equality Act, PECR, GDPR, EU Unfair Commercial Practices Directive, European Accessibility Act, UK DMCC Act, FCA DISP. Thirteen overlapping regimes before you count sector specifics.

We covered this in detail in the compliance audit post. The headline finding: most teams do not continuously verify any of this.

How to actually run the checklist

A few practical notes from running this against real voice bots.

Layer 1 belongs in CI. A regression suite covering the five conversational checks should run on every prompt or flow change. If your test tooling cannot dial a phone, you are not actually running this layer — you are testing the function-call wiring and hoping the rest works.

Layer 2 belongs in scheduled monitoring plus pre-release load tests. Time-to-greet and turn latency drift quietly. They should be measured on a schedule (we recommend hourly during business hours) so the drift is visible before customers find it. Concurrency-related signals should be measured with a real-call load test before any release that touches the bot's hot path.

Layer 3 belongs in compliance attestation runs. These don't need to run hourly. They need to run on a frequency you can defend to a regulator — monthly is a reasonable floor for a regulated industry — with dated, evidenced reports.

What changes when you run the full checklist continuously

Three things, in our experience.

The first is that the conversation around releases changes shape. You stop batching up fortnightly drops because you're scared of regressions. You ship the small change you wanted to ship, run the suite, see green, and move on.

The second is that production incidents collapse to the layers nobody was checking. Almost every voice bot incident we have done a postmortem on traces back to layer 2 or layer 3 — a TTS provider degrading silently, an ASR pipeline overloading, a disclosure regression nobody spotted because the prompt change passed unit tests.

The third is that the auditor gets easier. There is now a place where the tests are written down, a place where the conditions are written down, a place where the evidence is quoted back. The auditor on the other side of the table can read the same transcript you did and check your working.

If you'd like a hand setting up the full checklist against a number you operate, we are happy to scope a pilot.

Or register for a free account and start running layer-1 tests today.

Want us to explore your IVR?

TotalPath runs the same kind of test against your stack. Real audio. Real findings.