Agent Type

Voice Agent

Definition

An AI agent that communicates via spoken language — processing speech input (STT), reasoning about the request, taking actions, and responding with synthesized speech (TTS). Voice agents operate in real-time conversational contexts where latency is critical (>500ms response time feels unnatural). They combine language understanding with audio processing, prosody interpretation (tone, emphasis, pauses), and turn-taking management. Voice agents are deployed in customer service, healthcare, education, and personal assistant applications.

Builder Context

Voice agent engineering is fundamentally a latency engineering problem. The pipeline — STT → reasoning → tool use → TTS — must complete in under 1 second for natural conversation. Optimize each stage: use streaming STT (process audio as it arrives, don't wait for silence), keep reasoning prompts short and focused, parallelize tool calls where possible, and start TTS as soon as the first sentence is generated. The most common failure: building a text agent and wrapping it in voice, instead of designing for voice-first (shorter responses, confirmation patterns, interruption handling).