Table of Contents
Fetching ...

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Varun Nathan, Shreyas Guha, Ayush Kumar

TL;DR

This work introduces the first domain-grounded benchmark for tool-aware planning in contact-center analytics, emphasizing lineage-guided query decomposition across structured (Text2SQL/Snowflake) and unstructured (RAG/transcripts) tools. It combines a formal plan representation with dependencies to enable parallel execution, an iterative Evaluator→Optimizer loop that yields plan lineages, and a dual evaluation framework (metric-wise 0–100 scoring and one-shot plan-to-reference comparison) evaluated across 14 diverse LLMs. Findings show that models struggle with compound, long-horizon plans and tool-usage completeness, though the lineage prompts and iterative refinement can improve executability for certain models; shorter, simpler plans are consistently easier. The framework demonstrates a reproducible path for assessing and improving agentic planning with tools in production-like data-analysis tasks, and it lays the groundwork for online replanning and broader tool ecosystems in the future.

Abstract

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

TL;DR

This work introduces the first domain-grounded benchmark for tool-aware planning in contact-center analytics, emphasizing lineage-guided query decomposition across structured (Text2SQL/Snowflake) and unstructured (RAG/transcripts) tools. It combines a formal plan representation with dependencies to enable parallel execution, an iterative Evaluator→Optimizer loop that yields plan lineages, and a dual evaluation framework (metric-wise 0–100 scoring and one-shot plan-to-reference comparison) evaluated across 14 diverse LLMs. Findings show that models struggle with compound, long-horizon plans and tool-usage completeness, though the lineage prompts and iterative refinement can improve executability for certain models; shorter, simpler plans are consistently easier. The framework demonstrates a reproducible path for assessing and improving agentic planning with tools in production-like data-analysis tasks, and it lays the groundwork for online replanning and broader tool ecosystems in the future.

Abstract

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.
Paper Structure (172 sections, 1 theorem, 11 equations, 1 figure, 51 tables, 1 algorithm)

This paper contains 172 sections, 1 theorem, 11 equations, 1 figure, 51 tables, 1 algorithm.

Key Result

Lemma 1

The feedback loop in Algorithm alg:plan-optimizer terminates in at most $M$ passes. Within each pass, the inner scan either (i) advances the step index $i$ or (ii) applies a finite structural change to the plan that is immediately recorded and re-checked; the outer guard $pass<M$ guarantees terminat

Figures (1)

  • Figure 1: Iterative step-wise evaluator $\rightarrow$ plan optimizer loop producing a plan lineage. For each pass, every step $i$ of the current plan $P$ is diagnosed by the step-wise evaluator and optionally edited by the plan optimizer. Any updated plan $P'$ is appended to the lineage database. The loop stops when a full pass yields no changes or when the maximum number of passes is reached, and the final plan $P^\star$ is human-verified.

Theorems & Definitions (2)

  • Lemma 1: Termination
  • proof : Proof sketch