Table of Contents
Fetching ...

TAI3: Testing Agent Integrity in Interpreting User Intent

Shiwei Feng, Xiangzhe Xu, Xuan Chen, Kaiyuan Zhang, Syed Yusuf Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang

TL;DR

TAI3 tackles the problem of intent integrity in LLM agents that interpret user instructions to invoke API toolkits. It introduces semantic partitioning of API parameters into VALID, INVALID, and UNDERSPEC categories, followed by intent-preserving mutations guided by a lightweight error predictor and reinforced by an evergreen strategy memory to accelerate discovery. Evaluations over 80 toolkit APIs across five domains demonstrate that TAI3 outperforms a naive baseline in both error exposure and query efficiency and generalizes to smaller testing models and evolving APIs. The framework provides a scalable, API-centric approach to validate that agents preserve user intent, with practical implications for safety and reliability in real-world AI automation.

Abstract

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, TAI3 maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

TAI3: Testing Agent Integrity in Interpreting User Intent

TL;DR

TAI3 tackles the problem of intent integrity in LLM agents that interpret user instructions to invoke API toolkits. It introduces semantic partitioning of API parameters into VALID, INVALID, and UNDERSPEC categories, followed by intent-preserving mutations guided by a lightweight error predictor and reinforced by an evergreen strategy memory to accelerate discovery. Evaluations over 80 toolkit APIs across five domains demonstrate that TAI3 outperforms a naive baseline in both error exposure and query efficiency and generalizes to smaller testing models and evolving APIs. The framework provides a scalable, API-centric approach to validate that agents preserve user intent, with practical implications for safety and reliability in real-world AI automation.

Abstract

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce TAI3, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, TAI3 generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, TAI3 maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that TAI3 effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, TAI3 generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

Paper Structure

This paper contains 27 sections, 2 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: An example where the agent misinterprets user intent. Our proposed TAI3 aims to uncover such cases in a systematic and strategic way.
  • Figure 2: Motivating Example. (a) Documentation for an API from a smart lock toolkit. (b) Three examples of intent integrity violations (API call traces omitted for brevity). (c) A simplified parameter-partition form of the API, showing 3 categories and 14 equivalence classes.
  • Figure 3: Overview of TAI3. Stage 1 constructs a parameter-partition form via sematic partitioning and generates seed tasks for each partition. Stage 2 performs intent-preserving mutation (enhanced by retrieving relevant past strategies), ranks mutated tasks by error likelihood, executes the target agent, and updates the strategy memory when novel strategies are found.
  • Figure 4: How TAI3 mutates a seed task to reveal errors in an agent. It iteratively produce new variants that preserve the original user intent while increasing the likelihood of inducing an agent error. In this way, TAI3 prioritizes tasks those are most likely to induce an error.
  • Figure 5: TAI3 ranks error-triggering tasks higher, leading to consistently better $\mathtt{EESR}$$\uparrow$.
  • ...and 11 more figures