We pointed TotalPath at four public AI Receptionist demos. Here is what we found.

    By Phil Smith
    ··6 min read

    We have been quietly testing public AI Receptionist demos with TotalPath's Generic Explorer. The Generic Explorer places a real call, runs through a deterministic conversational script, and records every turn. No special integration. No vendor cooperation. Just a phone number.

    Here is what came back from four of them.

    aiphonecalls.co.uk (Arrow)

    Arrow at aiphonecalls.co.uk
    0
    10
    20
    30
    40
    50
    60
    1:10
    1:20
    1:30
    1:40
    1:50
    2:00
    ExplorerReceptionist
    0:00 / 2:01

    Arrow opens well. Crisp introduction, asks for the caller's name, offers a thirty-second role-play. The explorer takes the role-play, suggests a dental practice, asks to book a check-up.

    That is where it falls apart. Mid-booking, the assistant asks "what type of business is that?" having just been told. After the explorer gives a name and a preferred date, there is an eleven-second pause. Then the bot abandons the booking and loops back to its opening line about your full name and preferred date for the check-up. The role-play has reset without the caller noticing. Fluent language. No memory.

    reallytics.ai

    Tiffany at reallytics.ai
    0
    10
    20
    30
    40
    50
    60
    1:10
    ExplorerReceptionist
    0:00 / 1:16

    The bot introduces itself as "Tiffany from Workman," asks how it can help, and lists services. The explorer asks about security cameras specifically.

    Then Tiffany goes silent for eleven seconds.

    Eleven seconds is long enough on a phone call that a real customer would either repeat themselves or hang up. When Tiffany comes back she nails the topic with a short, useful product summary. The information was good. The wait was disqualifying.

    A few seconds later Tiffany says "sorry, I didn't catch that" to the explorer's silence. The bot is well-trained on its own catalogue but reads the conversational rhythm poorly. Pace eats the demo.

    vokaai.com (VOCA AI)

    Sam at vokaai.com
    0
    10
    20
    30
    40
    50
    60
    1:10
    1:20
    1:30
    1:40
    1:50
    2:00
    2:10
    ExplorerReceptionist
    0:00 / 2:19

    This one is the strongest of the four. "Sam" responds quickly, structures the pitch around the caller's stated need, and drops a tidy paragraph on bookings, cancellations and integrations. Turn-taking feels close to natural.

    The seam shows when the explorer asks "what scheduling systems does VOCA make?" Sam hears half the question, replies "looks like your message got cut off again," and asks the explorer to repeat itself. This is ASR confidence dropping mid-utterance, likely because the explorer paused on the word "make." Real callers do that several times in any phone conversation. The bot did not recover gracefully. It asked the caller to repeat instead of inferring.

    voqalai.com (Voqal)

    Voqal at voqalai.com
    0
    10
    20
    30
    40
    50
    60
    1:10
    ExplorerReceptionist
    0:00 / 1:19

    This call did not survive its own greeting.

    After the receptionist's first turn, every subsequent response was the literal string "two end underscore call." That is an unresolved template placeholder. The bot's prompt almost certainly contains something like ${TWO_END_UNDERSCORE_CALL} or a function-call name leaking into the spoken output instead of triggering an action. The explorer tried voice replies, then DTMF (1, 2, 0), and got the same phrase back every time. After roughly eighty seconds the explorer diagnosed the failure mode and hung up.

    What this tells us

    The four demos surface four different failure shapes. Each is the kind of thing a manual QA pass might miss but a fleet of automated explorers catches every release.

    • Arrow loses conversational context under task complexity, role-play inside role-play. Its language is fluent. Its memory is not.
    • Reallytics has good content and bad latency. Eleven seconds of dead air ends a call before content quality matters.
    • VOCA AI has the best baseline but no resilience to ASR drop-outs. One half-heard syllable forces the caller to start the question again.
    • Voqal is ship-blocked by an unresolved template variable that was never hit during whatever testing they ran before publishing the demo.

    If you ship a voice agent and you have not run TotalPath against it since the last prompt change, you do not actually know what your callers are hearing right now. None of these issues would be caught by reading a transcript. They are timing, ASR confidence, and template resolution. That is what TotalPath looks for.

    Want us to explore your IVR?

    TotalPath runs the same kind of test against your stack. Real audio. Real findings.