Table of Contents
Fetching ...

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

TL;DR

AuditDM introduces an RL-based MLLM auditor that actively discovers capability gaps by generating failure-inducing question–image pairs to maximize cross-model disagreement. The framework yields annotation-free, targeted data for rectification and demonstrates substantial performance gains across Gemma3 and PaliGemma2 in 16 benchmarks, sometimes surpassing larger models. By focusing on interpretable failure modes and a closed-loop improvement cycle, AuditDM addresses diminishing returns from mere data scaling and offers a scalable path to continual MLLM enhancement. The work highlights the practical value of model auditing as a diagnostic and corrective tool in multimodal AI systems.

Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

TL;DR

AuditDM introduces an RL-based MLLM auditor that actively discovers capability gaps by generating failure-inducing question–image pairs to maximize cross-model disagreement. The framework yields annotation-free, targeted data for rectification and demonstrates substantial performance gains across Gemma3 and PaliGemma2 in 16 benchmarks, sometimes surpassing larger models. By focusing on interpretable failure modes and a closed-loop improvement cycle, AuditDM addresses diminishing returns from mere data scaling and offers a scalable path to continual MLLM enhancement. The work highlights the practical value of model auditing as a diagnostic and corrective tool in multimodal AI systems.

Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

Paper Structure

This paper contains 27 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of AuditDM. We propose to train an auditor model to systematically discover capability gaps in an MLLM by generating failure-inducing question–image pairs. We show three automatically generated examples of weaknesses in object relationships. The proposed framework offers diagnostic insight and enables targeted rectification via auditor-guided feedback.
  • Figure 2: Model improvement with AuditDM. We report average performance over all benchmarks per model (excluding MME due to its incompatible score scale). Once trained, AuditDM generates targeted, large-scale data points aligned with discovered weaknesses, training on which can produce consistent gains across diverse models and benchmarks.
  • Figure 3: AuditDM architecture. AuditDM fine-tunes an MLLM into an auditor that generates challenging probing questions and counterfactual images (via captions for image regeneration or editing commands), yielding question–image pairs on which the target model fails while the MLLM ensemble agrees, thus exposing capability gaps and failure modes. The auditor is trained to maximize prediction discrepancy between the target and the ensemble. Once trained, it identifies weaknesses and failure cases in a single inference pass.
  • Figure 4: AuditDM identifies the top 15 failure modes and challenging task categories for PaliGemma2‑3B and 28B models at 448px$^2$, and we report normalized per-category failure rates. Tasks are ordered left to right, beginning with the most pronounced weaknesses of the 3B model and progressing to those of the 28B. Notably, we observe that for certain tasks, the 28B model performs significantly worse than the 3B model. For example, on challenging images, the 28B model struggles more with color recognition and counting, and is more prone to hallucination.
  • Figure 5: Generated examples for each failure category. To better demonstrate the effectiveness, we focus on examples with original images and generated questions. Image-question pairs with both generated images and questions are provided in Fig. \ref{['fig:model_edit']}. Some images are cropped or rotated for better figure layout. Original images and additional examples are provided in Sec. \ref{['sec:supp:qua']}.
  • ...and 4 more figures