Table of Contents
Fetching ...

Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

Yichen Yan, Xingjian He, Sihan Chen, Shichen Lu, Jing Liu

TL;DR

This work tackles Referring Image Segmentation (RIS), where achieving precise text-to-pixel alignment is difficult when relying on single-modality fusion. It introduces FCNet, a bi-directional vision-language guided framework that first performs vision-guided fusion to extract $N_k$ key visual channels and fuse them with language features via cross-attention, then applies language-guided calibration using a global language representation $F_t$ to produce calibrated features $F_c$ for decoding. A transformer-based decoder propagates fine-grained text information to the visual stream to generate accurate masks. Evaluations on RefCOCO, RefCOCO+, and G-Ref demonstrate state-of-the-art performance across backbones, validating improved cross-modal interaction and text-to-pixel correlation with competitive efficiency.

Abstract

Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.

Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation

TL;DR

This work tackles Referring Image Segmentation (RIS), where achieving precise text-to-pixel alignment is difficult when relying on single-modality fusion. It introduces FCNet, a bi-directional vision-language guided framework that first performs vision-guided fusion to extract key visual channels and fuse them with language features via cross-attention, then applies language-guided calibration using a global language representation to produce calibrated features for decoding. A transformer-based decoder propagates fine-grained text information to the visual stream to generate accurate masks. Evaluations on RefCOCO, RefCOCO+, and G-Ref demonstrate state-of-the-art performance across backbones, validating improved cross-modal interaction and text-to-pixel correlation with competitive efficiency.

Abstract

Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation. Previous methods typically rely on single-modality features, such as vision or language features, to guide the multi-modal fusion process. However, this approach limits the interaction between vision and language, leading to a lack of fine-grained correlation between the language description and pixel-level details during the decoding process. In this paper, we introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles. Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information. We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence. This bi-directional vision-language guided approach produces higher-quality multi-modal features sent to the decoder, facilitating adaptive propagation of fine-grained semantic information from textual features to visual features. Experiments on RefCOCO, RefCOCO+, and G-Ref datasets with various backbones consistently show our approach outperforming state-of-the-art methods.
Paper Structure (15 sections, 5 equations, 4 figures, 5 tables)

This paper contains 15 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a): In the previous methods (i.e., VLT ding2022vlt), the single-guided fusion approach impede effective interaction of vision and language, resulting in a suboptimal text-pixel correlation in the decoding stage. (b): Conversely, our fusion approach includes two parts: an initial vision-guided fusion process and a language-guided calibration process. we extract the key visual information with channel $N_k$ from the original vision features and fuse them with language features by a vision-guided approach. Then we use the global language information to calibrate these fused features by a language-guided approach. This bi-directional vision-language guided approach can obtain multi-modal features where the visual and linguistic information can deeply integrate. This proves to be more advantageous for text-to-pixel correlation during the decoding stage.
  • Figure 2: There are four main stages in our method, text & image encoding, vision-guided fusion, language-guided calibration and mask decoding. The main modules of our method are the Emphasis Generation and Emphasis Calibration.
  • Figure 3: The process of our language-guided calibration. We employ the global language representation $f_{vg}$ to guide the generation of corresponding scores for each emphasis feature.
  • Figure 4: Visualization comparison of FCNet and baseline. The baseline framework utilize the single-guided fusion approach.