You can talk to one of our bots directly at https://leapingai.com, and there’s a demo video at https://www.youtube.com/watch?v=xSajXYJmxW4.
Large companies are understandably reluctant to have AI start picking up their phone calls—the technology kind of works, but often not very well. If they do take the plunge, they often end up spending months tuning the prompts for just one use-case, and sometimes never even end up releasing the voice bot.
The problem is two-sided: it's non-trivial to specify the exact way a bot should behave using plain language, and it's tedious to ensure the LLM always follows your instructions the way you intended them.
Existing voice AI solutions are a pain to set up for complex use cases. They require months of prompting all edge cases before going live, and then months of monitoring and improving prompting afterwards. We do that better than human prompters, and much faster, by running a continuous analysis + testing loop.
Our tech is roughly divided into three subcomponents: core library, voice server, and self-improvement logic. Core library models and executes the multi-stage (think n8n-style) voice agents. For the voice server we are using the ol’ reliable cascading way of STT->LLM->TTS. We tried out the voice-to-voice models, and although they felt really great to talk to, function-calling performance was expectedly much worse, so we are still waiting for them to get better.
The self-improvement works by first taking conversation metrics and evaluation results to produce ‘feedback’, i.e. specific ideas how the voice agent setup could be improved. After enough feedback is collected, we trigger a run of a specialized self-improvement agent. It is a cursor-style AI with access to various tools that changes the main voice agent. It can rewrite prompts, configure a stage to use a summarized conversation instead of a full one, and more. Each iteration produces a new snapshot of the agent, enabling us to route a small part of the traffic to it and promote it to production if things look ok. This loop can be set to run without any human involvement, thus making agents self-improve.
Leaping is use-case agnostic, but we currently focus on inbound customer support (travel, retail, real estate, etc.) and lead pre-qualification (medicare, home services, performance marketing) since we have a lot of success stories there.
We started out in Germany since that’s where we were in university, but initially growth was challenging. We decided to target enterprise customers right away and they showed reluctance to adopt voice AI as the front-door ‘face’ of their company. Additionally, for an enterprise with thousands of calls daily, it is infeasible to monitor all the calls and tune agents manually. To address their very valid concerns, we put all effort into reliability—and still haven’t gotten around to offering self-serve access, which is one reason we don’t have fixed pricing yet. (Also, with some clients we have outcome-based pricing, i.e. you pay nothing for calls that didn't convert a lead, only the ones that did.)
Things picked up momentum ever since we got into YC and moved to the US, but the cautious sentiment is also present here if you try to sell to big enterprises. We believe that doing evals, simulation, and A/B testing really really well is our competitive edge and what will enable us to solve large, sensitive use cases.
We’d love to hear your thoughts and feedback!
I wonder why! Most (or all) of customer support calls are recorded. Have you tried (or proposed) to train on that corpus on your Customers premises? You can do multiple evals in that setting - replay user calls into corpus trained ai agent vs generic ai agent and see the difference. Agents can be run on a 24x7 self-test, analysis, adjustment, and reporting loop. Continuously run that loop and compare the prompts of your ai agent vs human operators.
Edit: Grammar
I built a skeleton of an iOS app that managed my calls such that I could choose to answer, decline or send to my chat bot
So it gets real data from all my regular calls and in my state (1 party consent) I don’t need anyone’s permission to record every call. So that data kicks off a fine tuning running that can run overnight or locally to improve my personal model
My plan was to use whisper and a local model with my voice clone and it would talk with everyone I didn’t want to eventually to the point where I don’t ever talk with any person I don’t want to
I would pay you for a local way to do that, however I’d NEVER give you that data - but I’m sure plenty of people would
Kinda funny how many amazing CX companies start in Germany!
I’m the CEO & founder of Rime, so I’ve been following your progress with real interest. Feel free to reach out and I’d love to explore ways we might collaborate. Until then, wishing you tons of success on this big milestone!
i always feel with these bots its like way too "polished" in the responses or how it speak. maybe that's a good thing and we are just so used to hearing someone speaking more casually be less well spoken lol. it makes it feels inauthentic, but perhaps that will change over time.
Request a demo button also does nothing other than change the text on success - not sure if it even went through...
What sets us apart is multi-stage conversation modeling, out-of-the-box evals, and self-improvement!
In general, we currently have really high success rates with relatively constrained use cases, such as lead qualification and well scoped customer service use cases (e.g., appointment booking, travel cancellation).
In general, voice AI is hard because WYSIWYG (there is no human in the loop between what the bot is saying and what the person on the other side gets to hear). Not sure about legal, but for more complex use cases (e.g., product refunds in retail), there are many permutations in how two different customers might frame the same issue and so it might be harder to accurately instruct the AI agent in a way to guarantee high automation results (given plentitude of edge cases).
It is our belief therefore that voice AI works the best, when the bot is leading the conversation and it is always very clear what the next steps are...
Therefore I think the verticals of customer service and lead pre-qualification make a lot more sense. Since you guys have the numbers, I am curious to learn more about the way you define constraints for the bot and how often calls in these verticals deviate from these constraints.
I'm also curious about your opinions/if you've seen any successful use cases where the bot has to be a bit more "creative" to either string together information given to it or make reasonable extrapolations beyond the information it has.
P.S. Arkadiy is locked out of his HN account due to the anti-procrastination settings. HN team, can you plz help? :)
feedback (exaggerated): 1. change stage prompt 2. change function description 3. add extra instructions to the end of the context
metrics are easy to generalize (e.g. call transfer rate), but baseline is different for each agent, so we're interpreting only the changes, not the absolute values (in the context of self-improvement).