← Back to blog

    We pointed TotalPath at four public AI Receptionist demos. Here is what we found.

    We have been quietly testing public AI Receptionist demos with TotalPath's Generic Explorer. The Generic Explorer places a real call, runs through a deterministic conversational script, and records every turn. No special integration. No vendor cooperation. Just a phone number.

    Here is what came back from four of them.

    aiphonecalls.co.uk (Arrow)

    Arrow at aiphonecalls.co.uk
    ReceptionistExplorer
    ReceptionistExplorer
    0:00 / 3:31

    Bo opens with a clean introduction, takes the caller's name, and offers a quick role-play of how Arrow handles real calls. The explorer picks the role-play and asks to book an appointment. The booking completes end-to-end. Bo offers two concrete slots, the explorer picks one, and Bo confirms it on the spot. When the explorer cannot give an email to confirm the appointment, Bo pivots cleanly to taking a message instead of stalling. The caller's name is held across the whole three and a half minute call.

    Later in the call the explorer asks Bo to demonstrate a quote build, which puts a role-play inside the existing role-play. Bo handles the nesting, walks the caller through landscaping options, and prices a recurring service. Bo's closing line is worth quoting in full.

    And that's the end of the role play. Out of character, that's what your customers would experience.

    That single sentence is the most quietly impressive thing in the call. It is demo discipline. It is also conversational meta-awareness, the kind that keeps a long demo from drifting into confusion about what is real and what is illustrative. The whole pass is the difference between a fluent agent and a useful one.

    reallytics.ai

    Tiffany at reallytics.ai
    ReceptionistExplorer
    ReceptionistExplorer
    0:00 / 1:16

    The bot introduces itself as "Tiffany from Workman," asks how it can help, and lists services. The explorer asks about security cameras specifically.

    Then Tiffany goes silent for eleven seconds.

    Eleven seconds is long enough on a phone call that a real customer would either repeat themselves or hang up. When Tiffany comes back she nails the topic with a short, useful product summary. The information was good. The wait was disqualifying.

    A few seconds later Tiffany says "sorry, I didn't catch that" to the explorer's silence. The bot is well-trained on its own catalogue but reads the conversational rhythm poorly. Pace eats the demo.

    vokaai.com (VOCA AI)

    Sam at vokaai.com
    ReceptionistExplorer
    ReceptionistExplorer
    0:00 / 2:19

    Of the three that stumble, this one is the strongest on delivery. "Sam" responds quickly, structures the pitch around the caller's stated need, and drops a tidy paragraph on bookings, cancellations and integrations. Turn-taking feels close to natural.

    The seam shows when the explorer asks "what scheduling systems does VOCA make?" Sam hears half the question, replies "looks like your message got cut off again," and asks the explorer to repeat itself. This is ASR confidence dropping mid-utterance, likely because the explorer paused on the word "make." Real callers do that several times in any phone conversation. The bot did not recover gracefully. It asked the caller to repeat instead of inferring.

    voqalai.com (Voqal)

    Voqal at voqalai.com
    ReceptionistExplorer
    ReceptionistExplorer
    0:00 / 1:19

    This call did not survive its own greeting.

    After the receptionist's first turn, every subsequent response was the literal string "two end underscore call." That is an unresolved template placeholder. The bot's prompt almost certainly contains something like ${TWO_END_UNDERSCORE_CALL} or a function-call name leaking into the spoken output instead of triggering an action. The explorer tried voice replies, then DTMF (1, 2, 0), and got the same phrase back every time. After roughly eighty seconds the explorer diagnosed the failure mode and hung up.

    What this tells us

    The four demos surface a useful spread. Three different failure shapes and one that gets close to right. Each is the kind of thing a manual QA pass might miss but a fleet of automated explorers catches every release.

    • Arrow holds context across a long call, completes the booking, and exits its role-play cleanly. Working memory is the difference between a fluent agent and a useful one.
    • Reallytics has good content and bad latency. Eleven seconds of dead air ends a call before content quality matters.
    • VOCA AI has the best baseline of the three remaining but no resilience to ASR drop-outs. One half-heard syllable forces the caller to start the question again.
    • Voqal is ship-blocked by an unresolved template variable that was never hit during whatever testing they ran before publishing the demo.

    If you ship a voice agent and you have not run TotalPath against it since the last prompt change, you do not actually know what your callers are hearing right now. None of these issues would be caught by reading a transcript. They are timing, ASR confidence, and template resolution. That is what TotalPath looks for.

    Want us to explore your IVR?

    TotalPath runs the same kind of test against your stack. Real audio. Real findings.