Table of Contents
Fetching ...

Extending CLIP's Image-Text Alignment to Referring Image Segmentation

Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak

TL;DR

RISCLIP tackles referring image segmentation by reusing the cross-modal alignment inherent in CLIP. It freezes CLIP and augments it with Cross-modal Feature Extraction and Shared-space Knowledge Exploitation to turn patch-level groundings into precise pixel-level segmentations via a lightweight decoder. The two-stage training regime and targeted adapters preserve CLIP’s general knowledge while enabling dense prediction, yielding state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg benchmarks and strong gains over prior CLIP-based RIS methods. The approach demonstrates the practical value of leveraging cross-modal backbone alignment for RIS and offers a pathway to integrating CLIP-like models into dense, text-driven vision tasks.

Abstract

Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.

Extending CLIP's Image-Text Alignment to Referring Image Segmentation

TL;DR

RISCLIP tackles referring image segmentation by reusing the cross-modal alignment inherent in CLIP. It freezes CLIP and augments it with Cross-modal Feature Extraction and Shared-space Knowledge Exploitation to turn patch-level groundings into precise pixel-level segmentations via a lightweight decoder. The two-stage training regime and targeted adapters preserve CLIP’s general knowledge while enabling dense prediction, yielding state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg benchmarks and strong gains over prior CLIP-based RIS methods. The approach demonstrates the practical value of leveraging cross-modal backbone alignment for RIS and offers a pathway to integrating CLIP-like models into dense, text-driven vision tasks.

Abstract

Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.
Paper Structure (22 sections, 2 equations, 7 figures, 6 tables)

This paper contains 22 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: CLIP's image-text alignment produces preliminary patch-level groundings through cosine similarity between patch-level image and sentence-level text features. Building upon this alignment, we refine CLIP's groundings into accurate segmentations with three modules. Cross-modal Feature Extraction (CFE) modules enhance CLIP's unimodal image and text features by aligning them at candidate regions. Shared-space Knowledge Exploitation (SKE) modules leverage the rich alignment knowledge in CLIP's image-text shared-embedding space to discern the target referent. Lastly, a decoder transforms the patch-level grounding into a pixel-wise segmentation.
  • Figure 2: The overall pipeline of RISCLIP. We adopt frozen CLIP image and text encoders as backbones to exploit their aligned image and text features and adapt them to RIS with two modules, CFE and SKE. Firstly, the CFE modules between the encoders enable cross-modal commnuication between the two encoders to align their unimodal features at candidate regions. Secondly, the SKE modules on top of the encoders leverage the rich cross-modal alignment knowledge in CLIP's image-text shared embedding space to discern the target referent. Then a cosine similarity between the patch- and sentence-level features produces a patch-level grounding map. Lastly, a decoder refines the map into a pixel-level segmentation prediction.
  • Figure 3: Visualization of RISCLIP-B predictions on RefCOCOg-UMD nagaraja2016modeling test set. Row a) shows RISCLIP's understanding of various instances, row b) RISCLIP's detection of partial, blurry instances and differentiate similar objects, row c) RISCLIP's discernment of the target instance among resembling instances described by lengthy texts.
  • Figure A1: Visualization of RISCLIP-B predictions on RefCOCOg-UMD nagaraja2016modeling test set samples. RISCLIP fails to recognize alphabetic and numeric characters.
  • Figure A2: Visualization of RISCLIP-B predictions on RefCOCOg-UMD nagaraja2016modeling test set samples. RISCLIP fails to comprehend texts that describe the target object with the 'absence' of some attribute.
  • ...and 2 more figures