Table of Contents
Fetching ...

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang

TL;DR

The paper tackles the conflicting predictions and limited multimodal understanding that hinder multi-task visual grounding (REC and RIS). It introduces $C^3VG$, a two-stage coarse-to-fine framework with Rough Semantic Perception (RSP) and Refined Consistency Interaction (RCI), augmented by a Mask-guided Interaction Module and a Bidirectional Consistency Constraint Loss to align tasks. By extending multimodal pretraining (e.g., BEiT-3) to a multi-task setting and employing explicit and implicit cross-task interactions, the approach achieves state-of-the-art results on RefCOCO/+/g with faster convergence. The proposed method significantly improves both localization and segmentation accuracy and demonstrates practical benefits in data-limited scenarios, highlighting the value of coarse-to-fine priors and cross-task supervision for robust vision-language grounding.

Abstract

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

TL;DR

The paper tackles the conflicting predictions and limited multimodal understanding that hinder multi-task visual grounding (REC and RIS). It introduces , a two-stage coarse-to-fine framework with Rough Semantic Perception (RSP) and Refined Consistency Interaction (RCI), augmented by a Mask-guided Interaction Module and a Bidirectional Consistency Constraint Loss to align tasks. By extending multimodal pretraining (e.g., BEiT-3) to a multi-task setting and employing explicit and implicit cross-task interactions, the approach achieves state-of-the-art results on RefCOCO/+/g with faster convergence. The proposed method significantly improves both localization and segmentation accuracy and demonstrates practical benefits in data-limited scenarios, highlighting the value of coarse-to-fine priors and cross-task supervision for robust vision-language grounding.

Abstract

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture (), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of , which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.
Paper Structure (31 sections, 16 equations, 11 figures, 7 tables)

This paper contains 31 sections, 16 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Three examples of inconsistent results between multi-task outputs. (b) Two examples of failure in identifying targets due to insufficient multi-modal understanding.
  • Figure 2: (a) Examples of the intermediate process in the proposed coarse-to-fine consistency constraint framework. (b) Two pretraining architectures: the left diagram illustrates separate encodings for image and text modalities followed by fusion, using single-modal pretraining; the right diagram shows a fused encoding architecture with multimodal pretraining.
  • Figure 3: The overall framework of the proposed $\text{C}^3\text{VG}$. First, the image and text features are fused and encoded using a multi-modality encoder. In the RSP stage, the pixel decoder and query decoder generate coarse segmentation and detection results. In the RCI stage, these multi-task priors are further refined through interaction and consistency constraints.
  • Figure 4: Architecture of the Mask-guided Interaction Module (MIM). "Coor. Embed" denotes a linear layer that maps coordinate positions into the hidden space.
  • Figure 5: Visualization of intermediate model processes. First row: original image, GT, RSP stage, and RCI stage results. Second row: original, box-constrained, mask-constrained, and unified-constrained heatmaps.
  • ...and 6 more figures