Table of Contents
Fetching ...

Referencing Where to Focus: Improving VisualGrounding with Referential Query

Yabing Wang, Zhuotao Tian, Qingpei Guo, Zheng Qin, Sanping Zhou, Ming Yang, Le Wang

TL;DR

This work addresses visual grounding with DETR-like architectures, where target-agnostic, randomly initialized queries and reliance on deepest features hinder precise localization. It introduces RefFormer, a DETR-inspired framework that inserts a Query Adaptation (QA) module into the CLIP backbone to generate referential queries that embed target-related context and multi-level visual information for the decoder, effectively acting as an adapter to preserve CLIP knowledge. Key contributions include the QA-based referential query mechanism, language-guided multi-level fusion in decoding, and extensive benchmarking across five datasets showing state-of-the-art results with efficient training. The approach offers a principled way to improve query learning in visual grounding and demonstrates strong practical impact by delivering higher grounding accuracy while maintaining backbone stability.

Abstract

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

Referencing Where to Focus: Improving VisualGrounding with Referential Query

TL;DR

This work addresses visual grounding with DETR-like architectures, where target-agnostic, randomly initialized queries and reliance on deepest features hinder precise localization. It introduces RefFormer, a DETR-inspired framework that inserts a Query Adaptation (QA) module into the CLIP backbone to generate referential queries that embed target-related context and multi-level visual information for the decoder, effectively acting as an adapter to preserve CLIP knowledge. Key contributions include the QA-based referential query mechanism, language-guided multi-level fusion in decoding, and extensive benchmarking across five datasets showing state-of-the-art results with efficient training. The approach offers a principled way to improve query learning in visual grounding and demonstrates strong practical impact by delivering higher grounding accuracy while maintaining backbone stability.

Abstract

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.
Paper Structure (23 sections, 16 equations, 8 figures, 6 tables)

This paper contains 23 sections, 16 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparison of DETR-like method and our proposed method for visual grounding. (a) The existing method typically adopts the random initialization queries directly into the decoder to predict the target object. (b) We introduce the query adaption module (QA) to learn target-related context progressively, providing valuable prior knowledge for the decoder. (c) The attention map of the last layer in every QA module and decoder (bottom), respectively.
  • Figure 2: Overview of RefFormer. It adopts a DETR-like structure, consisting of a query adaptation (QA) module that seamlessly integrates into various layers of CLIP, along with a task-specific decoder. By incorporating the QA module, RefFormer can iteratively refine the target-related context and generate referential queries, which provide the decoder with prior context.
  • Figure 3: Illustration of our proposed Query Adaption Module, which mainly consists of CAMF and TR modules to generate the referential queries and promote the multi-modal features interaction. "R" represents the feature modulation.
  • Figure 4: Ablation studies of backbone, auxiliary loss, and learnable queries on RefCOCOg.
  • Figure 5: Convergence curves. Our method achieves better results with fewer training epochs on RefCOCOg.
  • ...and 3 more figures