We built an Arabic-English voice AI. Here's what we learned.
Bilingual voice agents are the hardest thing we've shipped this year. Here are the five lessons that matter — on latency, code-switching, and calling Khaltu Fatima.
We shipped a bilingual Arabic-English voice receptionist for a halal coffee wholesaler last quarter. It now handles 94% of their inbound wholesale calls without a human touching them. Here are the five things we wish somebody had told us in week one.
1. Latency is everything
Below 600ms round-trip, the agent feels alive. Above 1s, it feels dead. Above 1.5s, the caller hangs up.
We ended up running ASR, the LLM, and TTS on three different providers in three different regions to keep the round-trip under 650ms. It's expensive. It's the only thing that matters.
2. Code-switching is the baseline, not the bonus
Real bilingual Arab callers code-switch inside a single sentence. "Ana ab'at email wa'aetaqid inno wasal, but I'm not sure."
A voice model that can't code-switch mid-sentence is monolingual with extra steps. We use GPT-4o + Claude-haiku-routed fallbacks because each handles different dialect clusters better.
3. Nobody says numbers the same way
Phone numbers, dates, order IDs — in Arabic these are said in an order that maps nothing like English. The TTS has to know how to say a number in the right dialect, not just what the number is.
- 555 in Khaleeji ≠ 555 in Maghrebi
- Dates in Hijri vs. Gregorian require explicit disambiguation
- Order IDs need to be read digit-by-digit or the caller mishears
4. Khaltu Fatima will call your voicemail and stress-test you
The best QA pass we ever did was asking the client's khaltu (aunt) to call the line for 20 minutes. She found three bugs no engineer would have caught — including one where the agent would pause for three seconds if she coughed mid-sentence.
Build for khaltu, and you build for everyone.
5. A confident handoff beats a clever answer
The single biggest lift in customer satisfaction came when we reduced what the agent tried to answer. If confidence drops below 0.8, it says "Let me have Ahmed call you back by Maghrib." And then it actually does.
Confident handoff > clever half-answer. Always.
Where this goes next
Voice is three years away from being the default. We're building for that world now.