Table of Contents
Fetching ...

$τ$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Soham Ray, Keshav Dhandhania, Victor Barres, Karthik Narasimhan

Abstract

Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $τ$-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends $τ^2$-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio--enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 85%, voice agents reach only 31--51% under clean conditions and 26--38% under realistic conditions with noise and diverse accents--retaining only 30--45% of text capability; qualitative analysis confirms 79--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. $τ$-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.

$τ$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Abstract

Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce -voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends -bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio--enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 85%, voice agents reach only 31--51% under clean conditions and 26--38% under realistic conditions with noise and diverse accents--retaining only 30--45% of text capability; qualitative analysis confirms 79--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. -voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.
Paper Structure (95 sections, 2 equations, 4 figures, 22 tables)

This paper contains 95 sections, 2 equations, 4 figures, 22 tables.

Figures (4)

  • Figure 1: Task completion (pass@1) averaged across all domains. GPT-5 (reasoning) achieves 85%. Voice agents show two levels of degradation: under Clean conditions (clean audio, no interruptions), performance drops to 31--51% ($-$34 to $-$54pp); under Realistic conditions (realistic audio, interruptions), it falls further to 26--38% (retaining only 30--45% of text capability).
  • Figure 2: $\tau$-Voice extends $\tau^2$-bench (gray) with voice-specific components (green): a voice user simulator with configurable personas, audio environment, and turn-taking policy; a full-duplex audio streaming channel discretized into simulation ticks; and a provider adapter for adding new voice APIs. Task infrastructure (instructions, tools, databases, domain policies) is inherited.
  • Figure 3: Voice user simulator pipeline. Each tick, the simulator generates text, synthesizes speech with a persona, mixes in environmental audio, and applies telephony degradation to produce realistic caller audio.
  • Figure 4: Speech activity timeline from a Retail domain simulation with Gemini Live. A customer calls about exchanging a jigsaw puzzle and correcting their address. The legend distinguishes observations (User Int. = user interruption, Non-Agent Dir. = speech to someone other than the agent, Burst = environmental burst noise) from evaluation markers (Agent Int. = agent interruption, BC Issue = incorrect backchannel handling, Voc. Tic Error / Non-Agent Dir. Error = agent incorrectly yielding or responding to these stimuli).