Conversational AI that doesn't fall over: lessons from CallVista

Real estate agents were losing high-intent buying signals during calls because they lacked instant access to client history, budget and preferences mid-conversation. Valuable follow-up context was either missed entirely or scattered across disconnected notes, weakening conversion rates on tours and deals. The information existed — it was just in the wrong place at the wrong time.

Closing the gap meant building intelligence into the live call itself, not the post-call recap. That’s CallVista. We’ve been running it in production for nine months. Here’s what we got right, what we got wrong, and what we’d do differently if we were starting today.

The brief

The brief was time-sensitive: prompts had to surface during a live call, not after. We scoped a Flask + WebRTC architecture so transcription, intent recognition and CRM lookups could all run inline with the audio stream, with the AI prompt engine reasoning on each new utterance as it landed.

Privacy and audit needed to be designed in from day one — every transcript flows to a secured vault keyed to the deal record. The agent never has to think about where the conversation went; the system has already filed it.

The transcription stack

WebRTC captures the audio. A thin Flask gateway tees the stream into two paths: one for archival, one for real-time inference. The inference path forwards 250ms chunks to OpenAI’s Whisper for transcription, and the resulting tokens stream back through a normalisation layer that handles speaker labels and timestamps.

Rolling-window intent recognition

Intent is classified against a 15-second sliding window — short enough to react in time, long enough to catch a complete thought. Anything shorter and the model ping-pongs on partial utterances. Longer and the prompt arrives after the moment has passed.

The CRM loop

Every classified intent triggers a CRM lookup keyed on the active deal record. The agent’s UI shows the relevant context — last contact date, budget range, properties already toured — as a panel beside the live transcript. It’s not “ask the AI for context”; it’s “the AI puts context on the screen before you need it”. The distinction matters more than it sounds.

What we’d do differently

Three things we’d change if we were rebuilding from scratch:

Hold-time handling. We didn’t predict how often agents put callers on hold. The intent classifier interprets silence as a thinking pause. We’ve since added a hold-detection path.
Two-speaker streams. We started with a single mixed audio stream. Splitting agent and caller into independent streams improved intent accuracy by around 30% in our internal benchmarks.
Suggestion fatigue. Too many prompts during a single call train agents to dismiss them. The current rate-limiting — one suggestion per 30 seconds with priority queuing — was our third attempt.

CallVista has been live for nine months and is now the default workflow for the client’s sales team. The underlying architecture has held up to every load and feature change we’ve thrown at it.

AI & Automation

Mobile App Development

E-commerce Development

Custom Software Development

ERP & Odoo

Cloud & DevOps

Golang Development

Conversational AI that doesn’t fall over: lessons from CallVista

The brief

The transcription stack

Rolling-window intent recognition

The CRM loop

What we’d do differently

More from the journal.

Migrating 7M+ SKUs to Shopify Plus: what we’d do differently

Why we still pick Django for production AI products

What 16 years of Clutch reviews taught us about client trust

Have a similar problem
worth solving?

The brief

The transcription stack

Rolling-window intent recognition

The CRM loop

What we’d do differently

More from the journal.

Migrating 7M+ SKUs to Shopify Plus: what we’d do differently

Why we still pick Django for production AI products

What 16 years of Clutch reviews taught us about client trust

Have a similar problemworth solving?

Have a similar problem
worth solving?