CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut
TL;DR
CAVE introduces a real-world visual anomaly benchmark with 361 images and 334 anomalies to evaluate Vision-Language Models on anomaly detection, description, explanation, and justification, plus localization grounded in bounding boxes. The framework is anchored in cognitive science, separating perception, understanding, and manifestation, and annotates anomalies along dimensions of severity, surprisal, and complexity across six manifestation categories. Empirical results show current state-of-the-art models, including GPT-4o, struggle to achieve high accuracy in anomaly detection and localization, while explanations are relatively easier, though justification lags behind human performance and often lacks creativity. The work highlights the gap between synthetic benchmarks and real-world anomalies, emphasizes cultural biases as a challenge, and points to future directions in fine-grained visual representations and retrieval-based commonsense knowledge to advance anomaly detection and reasoning in VLMs.
Abstract
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
