Table of Contents
Fetching ...

Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection

Jens Petersen, Davide Abati, Amirhossein Habibian, Auke Wiggers

TL;DR

This work addresses the gap in generative data augmentation for automotive object detection by introducing a scene-aware location model that places new objects in realistic positions conditioned on scene depth and drivable space. It couples this with a diffusion-based inpainting system (with a lightweight mask decoder) to render objects and generate instance masks, producing augmented frames that are both realistic and diverse. The approach yields state-of-the-art gains on nuImages and BDD100K, achieving up to $2.8\times$ improvements over competitive methods and substantial gains in instance segmentation, while providing insights from extensive ablations about finetuning, masking, and placement realism. These results underscore the practical value of jointly modeling layout and appearance for data augmentation in real-world driving scenarios, though limitations remain related to scene diversity and dependence on auxiliary perception modules.

Abstract

Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene-aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to $2.8\times$ higher gains than the best competing approach ($+1.4$ vs. $+0.5$ mAP boost). We also demonstrate significant improvements for instance segmentation.

Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection

TL;DR

This work addresses the gap in generative data augmentation for automotive object detection by introducing a scene-aware location model that places new objects in realistic positions conditioned on scene depth and drivable space. It couples this with a diffusion-based inpainting system (with a lightweight mask decoder) to render objects and generate instance masks, producing augmented frames that are both realistic and diverse. The approach yields state-of-the-art gains on nuImages and BDD100K, achieving up to improvements over competitive methods and substantial gains in instance segmentation, while providing insights from extensive ablations about finetuning, masking, and placement realism. These results underscore the practical value of jointly modeling layout and appearance for data augmentation in real-world driving scenarios, though limitations remain related to scene diversity and dependence on auxiliary perception modules.

Abstract

Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene-aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to higher gains than the best competing approach ( vs. mAP boost). We also demonstrate significant improvements for instance segmentation.

Paper Structure

This paper contains 38 sections, 1 equation, 18 figures, 5 tables.

Figures (18)

  • Figure 1: An original scene (top left) and three augmented frames using different location modeling and augmentation strategies. Generated objects are indicated by green bounding boxes. Our approach proposes locations that fit the original scene, resulting in novel compositions with high visual realism and challenging occlusion cases. Approaches that reuse original locations, even with minor modifications such as in GeoDiffusion geodiffusion, generate frames with visual appearance diversity but limited location diversity. Approaches that add objects in random locations such as X-Paste xpaste disregard the realism of the resulting layout and, in turn, of the generated frames.
  • Figure 2: Overview of our augmentation pipeline. (A) We first use the location model to predict realistic bounding box locations for new objects, using depth and drivable space segmentation. (B) We then generate an object and corresponding instance mask using an inpainting model. (C) This allows us to create pseudo-labels for object detection and instance segmentation. Our approach scales to high resolution images, and creates realistic and challenging occlusion cases.
  • Figure 3: (Top) Our location model factorizes object placement into a series of conditional likelihoods, each of which is easy to approximate or parametrize. (Bottom) We sample a desired distance to the object, $d$, and determine admissible locations for this depth (red lines are two separate examples of such placement bands).
  • Figure 4: Example bounding box proposals from our location model, separated by class.
  • Figure 5: Example of nuImages frames augmented with our approach. We show the bounding boxes for all added objects. In diverse scenarios, the location and scale of added objects are realistic and thus result in realistic augmented images.
  • ...and 13 more figures