Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Ting Liu, Siyuan Li
TL;DR
The paper tackles zero-shot referring image segmentation by addressing mask feature quality and spatial alignment. It introduces a training-free hybrid global-local feature extraction framework that fuses region-specific and context-aware information from CLIP, paired with a spatial guidance augmentation that leverages spatial relationships, coherence, and positional cues. Through extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut, the method achieves substantial improvements over state-of-the-art zero-shot RIS baselines, demonstrating strong cross-dataset generalization and effectiveness without additional training. This work provides a versatile framework for region-text alignment, enhancing cross-modal understanding with practical implications for visual grounding and human-computer interaction.
Abstract
Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at https://github.com/fhgyuanshen/HybridGL .
