Towards Motion-aware Referring Image Segmentation

Chaeyun Kim; Seunghoon Yi; Yejin Kim; Yohan Jo; Joonseok Lee

Towards Motion-aware Referring Image Segmentation

Chaeyun Kim, Seunghoon Yi, Yejin Kim, Yohan Jo, Joonseok Lee

Abstract

Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL

Towards Motion-aware Referring Image Segmentation

Abstract

Paper Structure (17 sections, 4 equations, 11 figures, 8 tables)

This paper contains 17 sections, 4 equations, 11 figures, 8 tables.

Introduction
Related Work
Method
Data Augmentation with Motion-centric Verb Phrase Retrieval
Multi-modal Contrastive Learning
Multimodal Radial Contrastive Learning
Experiments
Experimental Setups
Results and Analysis
Ablation Study
Effects of Hyperparameters
Conclusion
M-Ref Motion & Static Split Examples
MBench Dataset Details
Details on MBench Construction
...and 2 more sections

Figures (11)

Figure 1: Appearance-based (left) vs. motion-centric (right) queries in RIS. Existing methods handle the former well but struggle with the latter.
Figure 2: Overview of the MRaCL framework, (a) We apply CE Loss to output masks of the original pairs, same as the baselines. The Fuser mix text embeddings and cross-modal embeddings from the decoder, returns multimodal representations $\textbf{z}^{(i)}$. (b) With those multimodal latents, we calculate similarity scores and filter out false negatives. For example, with $\textbf{z}^{(1)}$ as the anchor, $\textbf{z}^{(2)}$ is considered as false negative sample and masked out as shown in the bottom right diagram. Here, components with marked as '$*$' indicates that it originated form the baseline model.
Figure 3: Anisotropy phenomenon observed in a baseline model (CRIS). We plot the distribution of pairwise angular distances of 100K pairs, (a) from the original CRIS and (b) with our MRaCL loss. We clearly see that the model utilizes a significantly wider representation space using our MRaCL loss.
Figure 4: Overview of the MBench dataset annotation pipeline.
Figure 5: Visualization of activation maps with and without MRaCL on CRIS.
...and 6 more figures

Towards Motion-aware Referring Image Segmentation

Abstract

Towards Motion-aware Referring Image Segmentation

Authors

Abstract

Table of Contents

Figures (11)