Voice AI just got a brain upgrade. Here's what changed.

Today's signal

OpenAI released three new real-time audio models via its API on May 7: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is the first voice model to carry GPT-5-class reasoning, meaning it can handle complex requests, use tools mid-conversation, and maintain context across long exchanges rather than resetting after every reply.

Why it matters

Voice AI has been stuck in a call-and-response loop since the beginning. Models could sound human, but could not think well enough to handle anything beyond simple queries. GPT-Realtime-2 breaks that ceiling with a 128K context window (up from 32K), parallel tool calls, and tunable reasoning depth. GPT-Realtime-Translate now supports speech input in over 70 languages with output in 13, opening multilingual voice products that were previously too brittle to deploy. Enterprise customers, including Zillow, Deutsche Telekom, and Priceline, are already building on it.

The take

This is the release that makes voice agents a serious enterprise infrastructure story, not a demo. Every AI company has been racing to make text agents smarter. OpenAI just made the same bet on voice. If reasoning at the voice layer is now production-grade, the entire category of voice-heavy enterprise workflows, customer support, healthcare documentation, multilingual sales, gets repriced. The companies that build on this first will own territory that will be very expensive to take back.

The number

$64 per million audio output tokens. That is the price of GPT-Realtime-2 on the output side. For context, GPT-Realtime-Whisper, the transcription model, costs $0.017 per minute. The pricing gap between the reasoning model and the utility models tells you exactly how OpenAI is valuing intelligence at the voice layer.

Read the full breakdown at analyticsdrift.com/openai-gpt-realtime-2-voice-api

Voice AI just got a brain upgrade. Here's what changed.

Reply

Keep Reading

Drift