Table of Contents
Fetching ...

ProbeLLM: Automating Principled Diagnosis of LLM Failures

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, Xiangliang Zhang

TL;DR

ProbeLLM is proposed, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes, supporting a shift from case-centric evaluation toward principled weakness discovery.

Abstract

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

ProbeLLM: Automating Principled Diagnosis of LLM Failures

TL;DR

ProbeLLM is proposed, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes, supporting a shift from case-centric evaluation toward principled weakness discovery.

Abstract

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.
Paper Structure (41 sections, 3 theorems, 17 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 41 sections, 3 theorems, 17 equations, 16 figures, 11 tables, 1 algorithm.

Key Result

Theorem 5.3

Given an initial failure $x_0$, the Micro probing strategy, which samples $n$ independent semantic variants in $B(x_0, \epsilon)$, identifies the failure mode with probability $P_{succ}$ satisfying: As $n \to \infty$, $P_{succ}$ converges to 1 exponentially.

Figures (16)

  • Figure 1: Examples of individual failure cases and the corresponding failure modes that capture recurring error patterns.
  • Figure 2: Differences of Macro and Micro search strategy. Macro aims to diversify the search topics while Micro aims to enhance local exploration.
  • Figure 3: Overview of ProbeLLM. (I) ProbeLLM probes a target model using an LLM-based generator and initializes the search with seed test cases from an existing benchmark. (II) At Level 1, the hierarchical search selects between Macro and Micro regimes. Conditioned on the selected regime, ProbeLLM performs tool-augmented generation to propose new test cases and verifies the target model responses. (III) From the collected failure cases, ProbeLLM computes failure-aware embeddings, clusters failures, and applies boundary-aware induction to produce interpretable failure modes.
  • Figure 4: Error rates at different search depths across five datasets, with standard deviations computed over results from 12 target models.
  • Figure 5: Tool usage distribution for question generation (top pie) and answer generation (bottom pie), along with error rates across tools (right bar chart).
  • ...and 11 more figures

Theorems & Definitions (7)

  • Definition 5.1: Failure Mode
  • Theorem 5.3: Mode Identification Convergence
  • proof
  • Definition 5.4: Cumulative Regret
  • Lemma 5.5: Node-Level Regret Bound
  • Theorem 5.6: Search Convergence
  • proof