MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim
TL;DR
This work tackles the RIS data-augmentation bottleneck by introducing MaskRIS, a holistic masking framework that applies simultaneous image and text masking to generate diverse, semantically coherent training samples. A dual-path Distortion-aware Contextual Learning (DCL) scheme regularizes learning by aligning predictions from original and masked inputs via self-distillation, without increasing inference cost. Empirically, MaskRIS yields state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg, improves robustness to occlusion and linguistic variation, and transfers effectively to RefClef and REC tasks, all with architecture-agnostic applicability. The approach offers a practical, scalable augmentation strategy for referring tasks that previously struggled with conventional augmentations.
Abstract
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
