Table of Contents
Fetching ...

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation

Seongsu Ha, Chaeyun Kim, Donghwa Kim, Junho Lee, Sangho Lee, Joonseok Lee

TL;DR

This work proposes a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo), which augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging.

Abstract

Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation

TL;DR

This work proposes a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo), which augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging.

Abstract

Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.

Paper Structure

This paper contains 25 sections, 21 figures, 17 tables.

Figures (21)

  • Figure 1: Diverse visual and linguistic challenges of referring scenarios. In (a), query (1) demands discernment among three road signs, while query (2) involves identifying a "woman", relatively easier due to a single instance. NeMo, our method, in (b) uses similar negative images to generate a mosaic. Query (2) becomes harder as the augmented image contains additional instances of "woman" (e.g., women standing or sitting), and thus "in front of the wall" becomes crucial hint to solve the problem.
  • Figure 1: Statistics of representative Referring Image Segmentation (RIS) Datasets
  • Figure 2: Data samples from RIS benchmarks and augmented samples using our NeMo. RefCOCO and RefCOCO+ are characterized by relatively easier scenarios with simple referring expressions, whereas G-Refs encompass more challenging sets.
  • Figure 3: Overall NeMo pipeline. Given an image and a query, it selects negative images at a proper level of difficulty, filtering out visually or semantically images to the query to avoid false negatives and irrelevant (easy) images identified by text-to-image retrieval. It randomly selects three among the remaining to construct a mosaic.
  • Figure 4: Comparison of negative image choices Finding the "rightmost pizza" in (a) is nearly as easy as in the single image, as there is no other pizza-like object. Multiple road signs in (b) require discerning the relative location of a woman and an SUV, more challenging than the original single image. (c) is invalid as multiple images contain "a man jumping with a skateboard".
  • ...and 16 more figures