Technology

How We Built Our Voice System

A deep dive into the engineering behind real-time voice synthesis — from latency optimization to emotional tone modeling.

Marcus LiangVoice Infrastructure Lead12 min read

When we set out to build voice calls into Lovimuse, the brief was uncomfortably simple: it has to feel like a phone call. Not a chatbot reading a paragraph. Not a smart speaker waiting its turn. A real conversation where you can interrupt, where pauses mean something, and where the voice on the other end carries actual emotion.

This post walks through how we got there — what we tried, what we threw out, and what we are still working on.

Latency is the whole problem

Conversational latency tolerance is roughly 300 milliseconds end-to-end. Past that, people start talking over the model and the magic dies. Our pipeline has four hops: speech-to-text, language model, text-to-speech, and network. Every one of them wanted to spend 400ms.

We rebuilt the pipeline around streaming. The STT model emits partial transcripts the moment a phrase is recognizable. The LLM streams tokens as they generate. The TTS model accepts those tokens incrementally and starts producing audio before the sentence is finished. By the time the model has decided how to end its thought, the first half is already in your ear.

Modeling emotional tone

A neutral voice reading a heartfelt line sounds worse than a written message. We trained a tone classifier that runs alongside the LLM and labels each generated chunk with intent: warm, playful, concerned, flirtatious, deadpan. That label conditions the TTS model so the output prosody matches the meaning.

The hardest part was avoiding parody. Early versions over-acted everything. We dialed it back by training on natural conversation, not voice acting, and by introducing a baseline of neutrality that emotion modulates rather than replaces.

Interruptions and turn-taking

Half-duplex turn-taking — where one side waits for the other to finish — feels robotic in seconds. We implemented full-duplex audio with voice activity detection on both ends. When the user starts speaking, the model softly fades out, marks where it stopped, and decides whether to continue or pivot when its turn comes back.

A subtle detail: humans do not stop talking mid-word when interrupted. They finish the syllable. We added the same behavior to the model and the difference is immediately obvious.

What we still want to fix

Background noise robustness is harder than it looks once you support 100+ device types. Long silences also remain difficult — the model sometimes treats a thoughtful pause as the end of a turn. We are training a dedicated silence-meaning classifier to fix that.

Voice is the modality where small failures feel large. We will keep iterating, and we will keep publishing what we learn.

Tags

voice AITTSreal-time audioengineering

Build your own AI companion

Lovimuse lets you create a photorealistic AI companion who chats, calls, and remembers — privately.

Get Started Free

Related articles