Table of Contents
Fetching ...

IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis

Yuji Wang, Jingchen Ni, Yong Liu, Chun Yuan, Yansong Tang

TL;DR

IteRPrimE tackles zero-shot referring image segmentation by leveraging Grad-CAM from a vision-language pre-trained model in an iterative refinement loop and by emphasizing the primary word within referring expressions. The framework couples Iterative Grad-CAM Refinement Strategy (IGRS) with a Primary Word Emphasis Module (PWEM) to improve localization, supported by a selective mask proposal network to choose high-quality masks. Empirical results on RefCOCO/+/g and PhraseCut show state-of-the-art or strong performance, particularly in out-of-domain settings, demonstrating robust spatial and semantic reasoning without task-specific training. The approach highlights the potential of Grad-CAM-guided RIS for accurate, zero-shot segmentation under complex linguistic guidance.

Abstract

Zero-shot Referring Image Segmentation (RIS) identifies the instance mask that best aligns with a specified referring expression without training and fine-tuning, significantly reducing the labor-intensive annotation process. Despite achieving commendable results, previous CLIP-based models have a critical drawback: the models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects. This is because they generate all possible masks on an image and evaluate each masked region for similarity to the given expression, often resulting in decreased sensitivity to direct positional clues in text inputs. Moreover, most methods have weak abilities to manage relationships between primary words and their contexts, causing confusion and reduced accuracy in identifying the correct target region. To address these challenges, we propose IteRPrimE (Iterative Grad-CAM Refinement and Primary word Emphasis), which leverages a saliency heatmap through Grad-CAM from a Vision-Language Pre-trained (VLP) model for image-text matching. An iterative Grad-CAM refinement strategy is introduced to progressively enhance the model's focus on the target region and overcome positional insensitivity, creating a self-correcting effect. Additionally, we design the Primary Word Emphasis module to help the model handle complex semantic relations, enhancing its ability to attend to the intended object. Extensive experiments conducted on the RefCOCO/+/g, and PhraseCut benchmarks demonstrate that IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.

IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis

TL;DR

IteRPrimE tackles zero-shot referring image segmentation by leveraging Grad-CAM from a vision-language pre-trained model in an iterative refinement loop and by emphasizing the primary word within referring expressions. The framework couples Iterative Grad-CAM Refinement Strategy (IGRS) with a Primary Word Emphasis Module (PWEM) to improve localization, supported by a selective mask proposal network to choose high-quality masks. Empirical results on RefCOCO/+/g and PhraseCut show state-of-the-art or strong performance, particularly in out-of-domain settings, demonstrating robust spatial and semantic reasoning without task-specific training. The approach highlights the potential of Grad-CAM-guided RIS for accurate, zero-shot segmentation under complex linguistic guidance.

Abstract

Zero-shot Referring Image Segmentation (RIS) identifies the instance mask that best aligns with a specified referring expression without training and fine-tuning, significantly reducing the labor-intensive annotation process. Despite achieving commendable results, previous CLIP-based models have a critical drawback: the models exhibit a notable reduction in their capacity to discern relative spatial relationships of objects. This is because they generate all possible masks on an image and evaluate each masked region for similarity to the given expression, often resulting in decreased sensitivity to direct positional clues in text inputs. Moreover, most methods have weak abilities to manage relationships between primary words and their contexts, causing confusion and reduced accuracy in identifying the correct target region. To address these challenges, we propose IteRPrimE (Iterative Grad-CAM Refinement and Primary word Emphasis), which leverages a saliency heatmap through Grad-CAM from a Vision-Language Pre-trained (VLP) model for image-text matching. An iterative Grad-CAM refinement strategy is introduced to progressively enhance the model's focus on the target region and overcome positional insensitivity, creating a self-correcting effect. Additionally, we design the Primary Word Emphasis module to help the model handle complex semantic relations, enhancing its ability to attend to the intended object. Extensive experiments conducted on the RefCOCO/+/g, and PhraseCut benchmarks demonstrate that IteRPrimE outperforms previous state-of-the-art zero-shot methods, particularly excelling in out-of-domain scenarios.

Paper Structure

This paper contains 14 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) The general pipeline of CLIP-based methods. They lack the perception of spatial relative position due to the masked images. (b) The pipeline of our IteRPrimE with Iterative Grad-CAM Refinement Strategy and Primary Word Emphasis of "bike". (c) This is a comparative experiment of positional phrase accuracy between IteRPrimE and GL-CLIP on RefCOCO and RefCOCOg.
  • Figure 2: (a) The weak ability of the baseline model to differentiate the semantic relationships between the primary word "man" and the other noun phrases colored green and orange. PWEM can make the model aware of the targeted instance referred to by the main word. (b) The IGRS facilitates the expansion of highlighted areas, surpassing the confined small regions. (c) IGRS offers the model chances of self-correction.
  • Figure 3: The Grad-CAMs and attention maps (AM) of "partially damaged car". Since the attention map (d) and Grad-CAM (e) of the primary word "car" both contain unique activation areas compared to the others, they can be harnessed from local-spatial and global-token perspectives to enhance the focus on the targeted regions, respectively.
  • Figure 4: The proposed IGRS (left) and PWEM (right). The mask $M_{t}^{'}$ is the attention mask for cross-attention layers by dropping the most salient regions of Grad-CAM to zero. PWEM filters the meaningless tokens and augments the Grad-CAM representation from local and global aspects.
  • Figure 5: The qualitative comparisons with GL-CLIP. (a) The self-correction effect is brought by our IGRS, especially for positional phrases. (b) For the unseen phrases like "not", our model shows better robustness. (c) shows the gathering effect of IteRPrimE with high confidence to select the whole mask instead of a part like GL-CLIP.
  • ...and 1 more figures