Table of Contents
Fetching ...

C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Car Damage Detection

Abdellah Zakaria Sellam, Ilyes Benaissa, Salah Eddine Bekhouche, Abdenour Hadid, Vito Renó, Cosimo Distante

TL;DR

This paper tackles the challenge of fine-grained vehicle-damage detection by addressing the limitation of local feature conditioning in diffusion-based detectors. It introduces C-DiffDet+, a context-aware diffusion framework that fuses global scene context with local proposals through a Global Context Encoder (GCE) and Context-Aware Fusion (CAF), augmented by Adaptive Channel Enhancement (ACE) and enhanced Multi-Modal Fusion (MMF). The approach yields state-of-the-art results on the CarDD benchmark, with notable gains for small and hard-to-discriminate damages such as cracks and scratches, and demonstrates stronger localization through an iterative diffusion head conditioned on scene context. The method also shows generalization to another automotive-damage dataset (VehiDE) and provides detailed ablations and convergence analyses, underscoring the practical impact of context-aware diffusion for robust fine-grained detection in real-world conditions.

Abstract

Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains

C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Car Damage Detection

TL;DR

This paper tackles the challenge of fine-grained vehicle-damage detection by addressing the limitation of local feature conditioning in diffusion-based detectors. It introduces C-DiffDet+, a context-aware diffusion framework that fuses global scene context with local proposals through a Global Context Encoder (GCE) and Context-Aware Fusion (CAF), augmented by Adaptive Channel Enhancement (ACE) and enhanced Multi-Modal Fusion (MMF). The approach yields state-of-the-art results on the CarDD benchmark, with notable gains for small and hard-to-discriminate damages such as cracks and scratches, and demonstrates stronger localization through an iterative diffusion head conditioned on scene context. The method also shows generalization to another automotive-damage dataset (VehiDE) and provides detailed ablations and convergence analyses, underscoring the practical impact of context-aware diffusion for robust fine-grained detection in real-world conditions.

Abstract

Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains

Paper Structure

This paper contains 42 sections, 22 equations, 6 figures, 6 tables, 6 algorithms.

Figures (6)

  • Figure 1: Overview of the proposed Context-Aware DiffusionDet architecture. The framework consists of four key components: (1) Adaptive Channel Enhancement (ACE) blocks that enhance backbone and FPN features, (2) Global Context Encoder (GCE) for comprehensive scene understanding, (3) Context-Aware Fusion (CAF) that integrates global context with local features through cross-attention, and (4) enhanced Multi-Modal Fusion (MMF) with global context embeddings. This architecture addresses the local feature conditioning limitation in existing diffusion-based detectors by enabling the comprehensive integration of environmental context.
  • Figure 2: This figure compares the performance of our model against the baseline. The first row contains the original images with the ground truth annotations. The second row shows the bounding boxes generated by DiffusionDet. The third row shows the more accurate bounding boxes produced by our enhanced model.
  • Figure 3: The images display a visual comparison of two object detection models on Cardd dataset, DiffusionDet and our Model, for identifying car damage. The top row shows the original damaged car images. The middle row illustrates the heatmaps generated by the DiffusionDet model, which highlights areas it focuses on to detect damage. The bottom row presents the heatmaps from the Our Model combination, showing improved focus and accuracy in pinpointing the damaged regions.
  • Figure 4: Training loss curves comparing baseline DiffusionDet (blue) and C-DiffDet+ (orange) over 20,000 iterations.
  • Figure 5: Qualitative comparison between ground truth annotations in the CarDD dataset and our model’s predictions, illustrating both detection errors and inconsistencies in the dataset annotations that limit precise evaluation.
  • ...and 1 more figures