C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Car Damage Detection
Abdellah Zakaria Sellam, Ilyes Benaissa, Salah Eddine Bekhouche, Abdenour Hadid, Vito Renó, Cosimo Distante
TL;DR
This paper tackles the challenge of fine-grained vehicle-damage detection by addressing the limitation of local feature conditioning in diffusion-based detectors. It introduces C-DiffDet+, a context-aware diffusion framework that fuses global scene context with local proposals through a Global Context Encoder (GCE) and Context-Aware Fusion (CAF), augmented by Adaptive Channel Enhancement (ACE) and enhanced Multi-Modal Fusion (MMF). The approach yields state-of-the-art results on the CarDD benchmark, with notable gains for small and hard-to-discriminate damages such as cracks and scratches, and demonstrates stronger localization through an iterative diffusion head conditioned on scene context. The method also shows generalization to another automotive-damage dataset (VehiDE) and provides detailed ablations and convergence analyses, underscoring the practical impact of context-aware diffusion for robust fine-grained detection in real-world conditions.
Abstract
Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains
