Table of Contents
Fetching ...

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu

TL;DR

EAVL is introduced, a method explicitly aligning vision and language features in the segmentation stage using dynamic convolution kernels based on the input image and sentence and surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

Abstract

Referring image segmentation (RIS) aims to segment an object mentioned in natural language from an image. The main challenge is text-to-pixel fine-grained correlation. In the previous methods, the final results are obtained by convolutions with a fixed kernel, which follows a similar pattern as traditional image segmentation. These methods lack explicit alignment of language and vision features in the segmentation stage, resulting in suboptimal correlation. In this paper, we introduce EAVL, a method explicitly aligning vision and language features. In contrast to fixed convolution kernels, we introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence. Specifically, we generate multiple queries representing different emphases of language expression. These queries are transformed into a series of query-based convolution kernels, which are applied in the segmentation stage to produce a series of masks. The final result is obtained by aggregating all masks. Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation. We surpass previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins. Additionally, our method is designed to be a generic plug-and-play module for cross-modality alignment in RIS task, making it easy to integrate with other RIS models for substantial performance improvements.

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

TL;DR

EAVL is introduced, a method explicitly aligning vision and language features in the segmentation stage using dynamic convolution kernels based on the input image and sentence and surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

Abstract

Referring image segmentation (RIS) aims to segment an object mentioned in natural language from an image. The main challenge is text-to-pixel fine-grained correlation. In the previous methods, the final results are obtained by convolutions with a fixed kernel, which follows a similar pattern as traditional image segmentation. These methods lack explicit alignment of language and vision features in the segmentation stage, resulting in suboptimal correlation. In this paper, we introduce EAVL, a method explicitly aligning vision and language features. In contrast to fixed convolution kernels, we introduce a Vision-Language Aligner that aligns features in the segmentation stage using dynamic convolution kernels based on the input image and sentence. Specifically, we generate multiple queries representing different emphases of language expression. These queries are transformed into a series of query-based convolution kernels, which are applied in the segmentation stage to produce a series of masks. The final result is obtained by aggregating all masks. Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation. We surpass previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins. Additionally, our method is designed to be a generic plug-and-play module for cross-modality alignment in RIS task, making it easy to integrate with other RIS models for substantial performance improvements.
Paper Structure (16 sections, 14 equations, 5 figures, 6 tables)

This paper contains 16 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) In the previous methods (i.e., VLT ding2021vision), the vision-language features obtained after fusion are directly fed into a Transformer Decoder, and the final result is obtained using a fixed convolution kernel. This approach is similar to the method employed in the segmentation stage of traditional image segmentation. (b) Our approach differs from previous methods by generating specialized vision-language features known as queries and transforming them into a series of dynamic query-based convolution kernels. Our method not only maximizes the potential of vision-language features but also explicitly aligns the vision features with language features to achieve text-to-pixel fine-grained correlation.
  • Figure 2: EAVL mainly consists of a text encoder, an image encoder, a Multi-Query Generator, a Transformer Decoder, and a Vision-Language Aligner. The Vision-Language Aligner has two parts, a Multi-Mask Generator and a Multi-Query Estimator.
  • Figure 3: The details of Query Generation process.
  • Figure 4: Examples of masks focusing on different regions. Additionally, we provide their cor- responding scores obtained from the Multi-Query Estimator.
  • Figure 5: Visualization of our method (using Vision-Language Aligner) and traditional method using convolutions with a fixed kernel in the segmentation stage.