Table of Contents
Fetching ...

Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu, Suk-ju Kang

TL;DR

CrossVLT introduces bidirectional cross-aware early fusion between stage-divided vision and language Transformer encoders for referring image segmentation. It enhances cross-modal context modeling by exchanging information at every encoder stage and aligns intermediate features through a text-to-pixel contrastive loss applied before fusion. Empirical results on RefCOCO, RefCOCO+, and G-Ref show consistent improvements over state-of-the-art methods, with ablations validating the necessity of both fusion and alignment across all stages. The approach yields more precise segmentation and greater robustness to linguistic variation, highlighting the benefits of intermediate-feature alignment in multi-modal fusion tasks.

Abstract

Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other's information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.

Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation

TL;DR

CrossVLT introduces bidirectional cross-aware early fusion between stage-divided vision and language Transformer encoders for referring image segmentation. It enhances cross-modal context modeling by exchanging information at every encoder stage and aligns intermediate features through a text-to-pixel contrastive loss applied before fusion. Empirical results on RefCOCO, RefCOCO+, and G-Ref show consistent improvements over state-of-the-art methods, with ablations validating the necessity of both fusion and alignment across all stages. The approach yields more precise segmentation and greater robustness to linguistic variation, highlighting the benefits of intermediate-feature alignment in multi-modal fusion tasks.

Abstract

Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other's information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.
Paper Structure (19 sections, 9 equations, 17 figures, 7 tables)

This paper contains 19 sections, 9 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Architectures of various fusion approaches for referring image segmentation. (a) Late fusion approach (e.g., VLTding2021vision, CRISwang2022cris) that fuses in the transformer decoder after the encoder feature extraction. (b) Previous early fusion approaches (e.g., LAVTyang2022lavt, PLVliao2022progressive) that unidirectionally refer to the language features in the vision encoder. (c) Our CrossVLT that bidirectionally performs cross-aware early fusion at each stage to interconnect both encoders for mutual enhancement.
  • Figure 2: Overview of CrossVLT, consisting of the stage-divided vision and language encoders, the feature-based alignment, and the segmentation decoder. At each stage, the vision and language encoders consider each other’s features through cross-aware fusion to capture the rich contextual information in each encoder. The feature-based alignment is used to better embed the vision and language features into the same space by applying the contrastive learning to the intermediate stages of each encoder.
  • Figure 3: (a) The cross-aware fusion block fuses the cross-modal information bidirectionally. (b) The vision query fusion layer consists of two cross attentions with downsampling to consider language-aware multi-scale vision features. (c) The fusion layer using language features as a query.
  • Figure 4: The structure of the conventional alignment and our feature-based alignment. (a) The final features are solely responsible for aligning the vision and language features. (b) The low-level to high-level features engage in the alignment for a more comprehensive alignment of the intermediate cross-modal features.
  • Figure 5: The scheme of the alignment loss using vision feature tokens and a language [CLS] token to embed the cross-modal features into the same space.
  • ...and 12 more figures