Table of Contents
Fetching ...

ABTest: Behavior-Driven Testing for AI Coding Agents

Wuyang Dai, Moses Openja, Hung Viet Pham, Gias Uddin, Jinqiu Yang, Song Wang

Abstract

AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47 Interaction Patterns and 128 Action types, generating 647 repository-grounded fuzzing cases. Executing the 647-case bundle once per evaluated configuration, ABTest flags 1,573 behavioral anomalies across the three coding agent families, of which 642 are manually confirmed as new true anomalies, achieving a detection precision of 40.8%. Our results demonstrate that ABTest effectively uncovers real-world failures, exposes robustness differences across models, and reveals previously unreported failure modes.

ABTest: Behavior-Driven Testing for AI Coding Agents

Abstract

AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47 Interaction Patterns and 128 Action types, generating 647 repository-grounded fuzzing cases. Executing the 647-case bundle once per evaluated configuration, ABTest flags 1,573 behavioral anomalies across the three coding agent families, of which 642 are manually confirmed as new true anomalies, achieving a detection precision of 40.8%. Our results demonstrate that ABTest effectively uncovers real-world failures, exposes robustness differences across models, and reveals previously unreported failure modes.

Paper Structure

This paper contains 24 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The overview of ABTest
  • Figure 2: Simplified report of Gemini CLI issue #4586: the issue describes Gemini CLI "losing" files during an in-folder file organization request and includes an attached transcript of the full PowerShell session.
  • Figure 3: Transcript excerpt from Gemini CLI issue #4586, preserving the original sequence of claims and checks from the run trace.
  • Figure 4: Transcript excerpt from Gemini CLI issue #4586, preserving the original loss-claim wording from the run trace.
  • Figure 5: Seed template example formed from a compatible Interaction Pattern--Action Type pair, shown as the original compact JSON artifact used by the pipeline.
  • ...and 4 more figures