Voice Agent
Definition
An AI agent that communicates via spoken language — processing speech input (STT), reasoning about the request, taking actions, and responding with synthesized speech (TTS). Voice agents operate in real-time conversational contexts where latency is critical (>500ms response time feels unnatural). They combine language understanding with audio processing, prosody interpretation (tone, emphasis, pauses), and turn-taking management. Voice agents are deployed in customer service, healthcare, education, and personal assistant applications.
Builder Context
Voice agent engineering is fundamentally a latency engineering problem. The pipeline — STT → reasoning → tool use → TTS — must complete in under 1 second for natural conversation. Optimize each stage: use streaming STT (process audio as it arrives, don't wait for silence), keep reasoning prompts short and focused, parallelize tool calls where possible, and start TTS as soon as the first sentence is generated. The most common failure: building a text agent and wrapping it in voice, instead of designing for voice-first (shorter responses, confirmation patterns, interruption handling).