Table of Contents
Fetching ...

SemAug: Semantically Meaningful Image Augmentations for Object Detection Through Language Grounding

Morgan Heisler, Amin Banitalebi-Dehkordi, Yong Zhang

TL;DR

SemAug addresses the brittleness of traditional image augmentations by injecting contextually meaningful content into scenes via language grounding. It builds an object bank from dataset-derived masks, matches what and where to paste through word embeddings, and pastes selected instances without training extra context networks, enabling new categories and richer scene semantics. Across COCO and Pascal VOC, SemAug yields consistent improvements in object detection and segmentation across multiple architectures, with negligible computational overhead and enhanced data efficiency. This approach tightens the link between visual context and semantic knowledge, improving generalization while remaining practical for real-world deployment $\tilde{\mathbf{I}} = f_{\pi}(\mathbf{I}, \Omega)$.

Abstract

Data augmentation is an essential technique in improving the generalization of deep neural networks. The majority of existing image-domain augmentations either rely on geometric and structural transformations, or apply different kinds of photometric distortions. In this paper, we propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes. Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects that can be placed into relevant locations in the image (the what and where problems). Then it embeds these objects into their relevant target locations, thereby promoting diversity of object instance distribution. Our method allows for introducing new object instances and categories that may not even exist in the training set. Furthermore, it does not require the additional overhead of training a context network, so it can be easily added to existing architectures. Our comprehensive set of evaluations showed that the proposed method is very effective in improving the generalization, while the overhead is negligible. In particular, for a wide range of model architectures, our method achieved ~2-4% and ~1-2% mAP improvements for the task of object detection on the Pascal VOC and COCO datasets, respectively.

SemAug: Semantically Meaningful Image Augmentations for Object Detection Through Language Grounding

TL;DR

SemAug addresses the brittleness of traditional image augmentations by injecting contextually meaningful content into scenes via language grounding. It builds an object bank from dataset-derived masks, matches what and where to paste through word embeddings, and pastes selected instances without training extra context networks, enabling new categories and richer scene semantics. Across COCO and Pascal VOC, SemAug yields consistent improvements in object detection and segmentation across multiple architectures, with negligible computational overhead and enhanced data efficiency. This approach tightens the link between visual context and semantic knowledge, improving generalization while remaining practical for real-world deployment .

Abstract

Data augmentation is an essential technique in improving the generalization of deep neural networks. The majority of existing image-domain augmentations either rely on geometric and structural transformations, or apply different kinds of photometric distortions. In this paper, we propose an effective technique for image augmentation by injecting contextually meaningful knowledge into the scenes. Our method of semantically meaningful image augmentation for object detection via language grounding, SemAug, starts by calculating semantically appropriate new objects that can be placed into relevant locations in the image (the what and where problems). Then it embeds these objects into their relevant target locations, thereby promoting diversity of object instance distribution. Our method allows for introducing new object instances and categories that may not even exist in the training set. Furthermore, it does not require the additional overhead of training a context network, so it can be easily added to existing architectures. Our comprehensive set of evaluations showed that the proposed method is very effective in improving the generalization, while the overhead is negligible. In particular, for a wide range of model architectures, our method achieved ~2-4% and ~1-2% mAP improvements for the task of object detection on the Pascal VOC and COCO datasets, respectively.
Paper Structure (35 sections, 8 equations, 29 figures, 16 tables, 1 algorithm)

This paper contains 35 sections, 8 equations, 29 figures, 16 tables, 1 algorithm.

Figures (29)

  • Figure 1: Examples of our method: originals (left) and semantically augmented (right).
  • Figure 2: Various methods of augmentation: From left to right: the original image, traditional augmentations (flip, contrast/brightness adjustment, additive noise), random object placement, and SemAug (our method). A giraffe could reasonably be found in a field with elephants, whereas a traffic light has no contextual basis in this scene.
  • Figure 3: Illustration of our data augmentation approach. After an image is selected for semantic augmentation, the semantic labels are converted into word vectors. The similarity between these word vectors and the word vectors of the available objects to be pasted are computed. Then one of these objects is chosen from the object bank based on a criteria such as balancing the number of objects in a dataset, or adding more instances of a poor-performing object category. The chosen object is then pasted into the image in the vicinity of the most similar label.
  • Figure 4: Our method can augment different instances from the same object category. Top row: Different instances from the category airplane are inserted. Bottom row: Different instances from the category kite are inserted.
  • Figure 5: Our method can augment instances of different categories. Top row: An instance from the categories truck, bus and motorcycle are inserted. Bottom row: An instance from the categories sheep, and bird are inserted. Note that objects are inserted in logical locations: vehicles on roads, birds in trees, sheep on grass.
  • ...and 24 more figures