Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan, Shreyas Guha, Ayush Kumar
TL;DR
This work introduces the first domain-grounded benchmark for tool-aware planning in contact-center analytics, emphasizing lineage-guided query decomposition across structured (Text2SQL/Snowflake) and unstructured (RAG/transcripts) tools. It combines a formal plan representation with dependencies to enable parallel execution, an iterative Evaluator→Optimizer loop that yields plan lineages, and a dual evaluation framework (metric-wise 0–100 scoring and one-shot plan-to-reference comparison) evaluated across 14 diverse LLMs. Findings show that models struggle with compound, long-horizon plans and tool-usage completeness, though the lineage prompts and iterative refinement can improve executability for certain models; shorter, simpler plans are consistently easier. The framework demonstrates a reproducible path for assessing and improving agentic planning with tools in production-like data-analysis tasks, and it lays the groundwork for online replanning and broader tool ecosystems in the future.
Abstract
We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.
