How to load test an agentic IVR (and why most teams are doing it wrong)

By Phil Smith

·7 May 2026·7 min read

To load test an agentic IVR you simulate concurrent real phone calls — not synthetic SIP traffic — at the volume you expect to handle, while measuring four things: time-to-greet, ASR confidence per turn, end-to-end latency per turn, and whether the bot completes the journey or degrades. Synthetic packet-level load testing misses everything that matters in an agentic flow.

The short version of the playbook is four steps: pick one journey, ramp from 1 to N concurrent real calls, watch four signals at each level, find the breakpoint before your customers do.

The longer version follows.

Why packet-level load testing fails for agentic flows

Traditional IVR load testing came from the deterministic world. You knew the menu tree, you knew the prompts, you knew the expected DTMF responses. A SIP load generator could produce thousands of synthetic INVITE flows and validate that the IVR's call control held up. As long as the trunk side stayed responsive, you were good.

Agentic IVRs broke that model. The bot is a non-deterministic LLM with an ASR front-end and a TTS back-end. When you put 200 concurrent real callers on it, the failure modes are not on the SIP layer. They are:

Time-to-greet stretches. Your TTS provider's queue depth grows. The first prompt now arrives 4 seconds after pickup instead of 800 ms. Real customers either repeat themselves or hang up.
ASR confidence drops. The speech recognition pipeline shares GPU capacity across calls. Under load, confidence scores decay. The bot starts asking "sorry, I didn't catch that" to perfectly clear customers.
End-to-end turn latency stretches. The bot's turn loop — ASR → LLM → tool call → TTS — was tuned at one call concurrent. At fifty it is hitting rate limits on three different vendors.
Journey completion rate falls off a cliff. Not gradually. There is usually a knee in the curve where the bot stops finishing tasks because something in the loop is timing out and the LLM is now talking to an empty CRM response.

A SIP load generator sees none of this. You need to be making real calls and measuring what real customers experience.

Step 1: pick one journey

Pick the highest-value journey on the bot. Authentication and balance check, or appointment booking, or whatever pays the rent. Not the sprawling "everything the bot can do" suite. One journey, end-to-end, with a known correct outcome.

The reason is simple. Load testing surfaces emergent behaviour that doesn't show up in single-call testing. You want to be able to answer a specific question: "at N concurrent calls, does the customer still successfully complete this journey?" Diluting the test across twenty journeys hides the answer.

Step 2: ramp from 1 to N concurrent real calls

Start at one. Validate the test caller behaves correctly and the journey completes. This sounds trivial but it is the single most common reason load tests give bad data — the test was already broken at one call and the team only discovered it at fifty.

From one, ramp. Doubling is fine. 1, 2, 4, 8, 16, 32, 64, 128. At each rung, hold for at least 30 seconds of steady-state. The first 10 seconds are the bot warming up and you don't want to measure that.

The N you are looking for is your expected peak plus 50%. If your busy-hour peak is 80 concurrent voice sessions, you want to be confident at 120. If you don't know your peak, your CCaaS reporting does — go look.

Step 3: measure four signals at each level

These are the only signals that reliably tell you whether the bot is degrading.

Time-to-greet is the wall-clock time between call answer and the bot's first audible prompt. Healthy is under 1.5 seconds. Above 3 seconds you are losing customers. If this stretches under load, your TTS provider's queue or your bot's cold-start logic is the bottleneck.

ASR confidence per turn is the speech recognition system's reported confidence on each user utterance. If your provider exposes it (most do — Deepgram, Google, Azure, AssemblyAI), record it. If they don't, infer it from disambiguation rate (how often the bot says "sorry, I didn't catch that"). Under load this degrades silently and customers are forced to repeat themselves.

End-to-end latency per turn is the time from end-of-user-utterance to start-of-bot-utterance. Healthy is 600–1500 ms depending on whether the bot is doing tool calls. Above 3 seconds the bot has lost the rhythm of natural conversation and customers will start talking over it.

Journey completion rate is the percentage of test calls that reach the success state of the journey. At low concurrency this should be near 100% (assuming your test caller is well-designed). The level at which this falls off a cliff is your bot's real capacity ceiling, regardless of what the SIP-level dashboard says.

Step 4: find the breakpoint

The breakpoint is the concurrency level at which any of the four signals exits its acceptable range. Not when SIP fails. Not when the bot crashes. When customers would notice.

Once you have the breakpoint, you have three options.

The first is to provision capacity for it. Most TTS, ASR and LLM providers will let you reserve dedicated throughput at a higher cost. For a bot that handles known peak loads (retail, ticketing) this is usually the right answer.

The second is to adjust the bot's behaviour at high concurrency. Drop optional steps, take fewer LLM hops, fall back to a simpler model. We have customers who serve a fast-path version of the bot above N concurrent calls and a richer version below.

The third is to fail safely. Route to a hold queue with a recorded message. Don't let the bot try to handle a load it can't and fail in front of the customer.

Whatever you choose, do it before peak hits, not during.

What TotalPath does for this

We built TotalPath's load testing specifically because the synthetic-SIP approach was failing on agentic flows. Our load runner places real calls from real PSTN endpoints, runs the same conversational test caller across all of them, and gives you the four signals above on a single dashboard.

Pricing is per-second on the calls themselves. A 30-second journey at 100 concurrent calls is roughly $3.50 of load-testing cost on the free tier's PAYG rate. You can run a meaningful capacity validation for less than a takeaway.

What this means for the way you ship

Once you have a real load test that reflects real customer experience, the conversation around capacity changes. You stop guessing. You stop asking the vendor what their headline TPS is. You measure your own bot, on your own stack, and you find the breakpoint while it's cheap to fix.

Most of the agentic voice bot incidents we have seen in the last year would have been caught by a 20-minute load test the day before launch. None of them were.

If you operate an agentic voice bot and you have not run real-call load testing against it, you do not know your real capacity. We would love to help you find out.

Or register for a free account — you get 75 free load-testing minutes a month, no credit card.

Want us to explore your IVR?

TotalPath runs the same kind of test against your stack. Real audio. Real findings.