Table of Contents
Fetching ...

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres

TL;DR

This work introduces $\tau$-Knowledge, an extension of $\tau$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes.

Abstract

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

TL;DR

This work introduces -Knowledge, an extension of -Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes.

Abstract

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce -Knowledge, an extension of -Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, -Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only 25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, -Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.
Paper Structure (64 sections, 4 figures, 13 tables)

This paper contains 64 sections, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Overview of the $\tau$-Banking domain. Agents must interact with a knowledge base to acquire procedural knowledge, policies, tools, and business offerings in order to resolve complex user requests by invoking discovered tools that modify underlying databases. The example on the right illustrates an agent assisting a user who has lost a wallet containing bank cards: although the user initially requests to freeze the card, card-specific policies and transaction-history constraints require the agent to instead cancel the card.
  • Figure 2: Knowledge-base construction pipeline for $\tau$-Banking. First, a large-language model (LLM) expands high-level product/category lists into a structured schema of offerings and typed attributes (e.g., fees, bonuses, limits). Then, the structured records are transformed into natural-language documents (e.g., FAQs and policy articles) that distribute the underlying variables across documents. Finally, during task creation, humans and an LLM review, edit, link, and de-duplicate content to produce the final corpus.
  • Figure 3: (Left) pass1–4 reliability for the best-performing configuration of each model declines sharply with increasing $k$, with substantial variation in reliability across systems (e.g., Gemini-3-flash vs GPT-5.2 (high)). (Right) Pareto frontier of average duration per task versus pass1 performance highlights substantial solution efficiency differences.
  • Figure 4: Representative agent failure modes in $\tau$-Bench, grouped into four categories: (1) complex product interdependencies causing incorrect recommendations, (2) violations of required task ordering due to insufficient planning, (3) missing or unverified actions caused by over-trusting user statements, and (4) search inefficiency and unwarranted assumptions during retrieval-driven decision-making.