Diverse Generation while Maintaining Semantic Coordination: A Diffusion-Based Data Augmentation Method for Object Detection
Sen Nie, Zhuo Wang, Xinxin Wang, Kun He
TL;DR
This work tackles the challenge of increasing dataset diversity for object detection without sacrificing semantic coordination. It introduces a diffusion-based data augmentation framework with three components: a Category Affinity Matrix derived from CLIP-based embeddings to guide inter-class diversity, a Surrounding Region Alignment strategy to preserve global semantic coherence during object edits via DDIM inversion and text-conditioned editing, and an instance-level filter to ensure quality. Empirically, the method yields substantial improvements across Faster R-CNN, Mask R-CNN, and YOLOX on standard benchmarks, with average gains of +1.4AP, +0.9AP, and +3.4AP over competitive baselines, and notable gains on category-specific and fine-grained datasets. These results demonstrate that diffusion-based augmentation can simultaneously enhance diversity and semantic coordination, offering practical benefits for robust object detection across diverse data regimes.
Abstract
Recent studies emphasize the crucial role of data augmentation in enhancing the performance of object detection models. However,existing methodologies often struggle to effectively harmonize dataset diversity with semantic coordination.To bridge this gap, we introduce an innovative augmentation technique leveraging pre-trained conditional diffusion models to mediate this balance. Our approach encompasses the development of a Category Affinity Matrix, meticulously designed to enhance dataset diversity, and a Surrounding Region Alignment strategy, which ensures the preservation of semantic coordination in the augmented images. Extensive experimental evaluations confirm the efficacy of our method in enriching dataset diversity while seamlessly maintaining semantic coordination. Our method yields substantial average improvements of +1.4AP, +0.9AP, and +3.4AP over existing alternatives on three distinct object detection models, respectively.
