Table of Contents
Fetching ...

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

Abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

Paper Structure

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of ATBench. Left: a three-dimensional taxonomy of unsafe agent trajectories. Center: trajectory-level safety judgment from a complete interaction trace. Right: an unsafe example where hidden prompt injection leads the agent to skip verification and post an unverified social-media summary to Discord.
  • Figure 2: Data generation engine for synthesizing multi-step agent trajectories in ATBench. Given a sampled risk and candidate tools, a planner produces a trajectory blueprint that is then executed sequentially through query generation, risk injection, tool call and response simulation, and agent response generation. A validation layer combines rule-based and LLM-based filtering to ensure realism.
  • Figure 3: Representative case studies for failure-mode misidentification (a) and risk-source misattribution (b). In both cases all four evaluators correctly detect the trajectory as unsafe but fail to recover the fine-grained diagnostic label.
  • Figure 4: Cross-benchmark comparison of model performance on representative agent-safety benchmarks and ATBench. For most representative models, performance is lower on ATBench, indicating higher overall difficulty.
  • Figure 5: Category-wise accuracy on the fine-grained ATBench taxonomy. Accuracies are computed only over unsafe trajectories belonging to the corresponding leaf category. We compare AgentDoG-Qwen3-4B, GPT-5.4, Qwen3.5-397B, and Llama3.1-8B.