Table of Contents
Fetching ...

Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

Ting Liu, Siyuan Li

TL;DR

The paper tackles zero-shot referring image segmentation by addressing mask feature quality and spatial alignment. It introduces a training-free hybrid global-local feature extraction framework that fuses region-specific and context-aware information from CLIP, paired with a spatial guidance augmentation that leverages spatial relationships, coherence, and positional cues. Through extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut, the method achieves substantial improvements over state-of-the-art zero-shot RIS baselines, demonstrating strong cross-dataset generalization and effectiveness without additional training. This work provides a versatile framework for region-text alignment, enhancing cross-modal understanding with practical implications for visual grounding and human-computer interaction.

Abstract

Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .

Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

TL;DR

The paper tackles zero-shot referring image segmentation by addressing mask feature quality and spatial alignment. It introduces a training-free hybrid global-local feature extraction framework that fuses region-specific and context-aware information from CLIP, paired with a spatial guidance augmentation that leverages spatial relationships, coherence, and positional cues. Through extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut, the method achieves substantial improvements over state-of-the-art zero-shot RIS baselines, demonstrating strong cross-dataset generalization and effectiveness without additional training. This work provides a versatile framework for region-text alignment, enhancing cross-modal understanding with practical implications for visual grounding and human-computer interaction.

Abstract

Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .

Paper Structure

This paper contains 14 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Common issues in existing methods: 1) Inaccurate mask feature extraction; 2) Incorrect spatial localization; 3) Incomplete segmentation.
  • Figure 2: The proposed framework combines hybrid global-local feature extraction with multiple spatial guidance mechanisms to improve zero-shot referring image segmentation, using mask proposals generated by SAM. By leveraging both broad context and local details, and enhancing segmentation with spatial guidance, the framework effectively augments the segmentation of target based on textual descriptions.
  • Figure 3: Visual comparisons with existing methods. Our approach achieves more accurate localization and a complete segmentation of the target object.