Table of Contents
Fetching ...

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim

TL;DR

This work tackles the RIS data-augmentation bottleneck by introducing MaskRIS, a holistic masking framework that applies simultaneous image and text masking to generate diverse, semantically coherent training samples. A dual-path Distortion-aware Contextual Learning (DCL) scheme regularizes learning by aligning predictions from original and masked inputs via self-distillation, without increasing inference cost. Empirically, MaskRIS yields state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg, improves robustness to occlusion and linguistic variation, and transfers effectively to RefClef and REC tasks, all with architecture-agnostic applicability. The approach offers a practical, scalable augmentation strategy for referring tasks that previously struggled with conventional augmentations.

Abstract

Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

TL;DR

This work tackles the RIS data-augmentation bottleneck by introducing MaskRIS, a holistic masking framework that applies simultaneous image and text masking to generate diverse, semantically coherent training samples. A dual-path Distortion-aware Contextual Learning (DCL) scheme regularizes learning by aligning predictions from original and masked inputs via self-distillation, without increasing inference cost. Empirically, MaskRIS yields state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg, improves robustness to occlusion and linguistic variation, and transfers effectively to RefClef and REC tasks, all with architecture-agnostic applicability. The approach offers a practical, scalable augmentation strategy for referring tasks that previously struggled with conventional augmentations.

Abstract

Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

Paper Structure

This paper contains 22 sections, 5 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Conventional data augmentations (DA) in semantic segmentation are incompatible with referring image segmentation. Random crop and horizontal flip could change the referred object (e.g., "lady under the red umbrella on left") to another one, and color distortion could make the described object disappear.
  • Figure 2: Comparison with CM-MaskSD wang2024cm. (a) CM-MaskSD employs global feature guided masking via a vision-language model (VLM), such as CLIP CLIP, which requires a pre-aligned vision–language representation. It focuses on high-correlation regions but lacks diversity and architecture-agnostic properties. (b) MaskRIS adopts holistic context masking, yielding richer diversity and improved architecture independence.
  • Figure 3: Existing RIS methods show a noticeable decline in their performance when applying conventional image augmentations (random cropping, color jittering, and horizontal flipping). In contrast, image masking (I-Mask) and text masking (T-Mask) improve model performance.
  • Figure 4: The existing RIS method tends to be inaccurate when faced with occluded context. CARIS CARIS represents the SoTA method in RIS. Words highlighted in red represent occluded objects in the image (left and center) and masked words in the text query (right).
  • Figure 5: The overall framework of MaskRIS. Both image and text masking are employed to generate diverse image-text training pairs (Sec. \ref{['sec:3.2']}). To maximize the benefits of the masking strategy, Distortion-aware Contextual Learning (DCL) is introduced (Sec. \ref{['meethod:ours']}).
  • ...and 7 more figures