Table of Contents
Fetching ...

A Simple and Better Baseline for Visual Grounding

Jingchao Wang, Wenlong Zhang, Dingjiang Huang, Hong Wang, Yefeng Zheng

TL;DR

The paper addresses visual grounding by eliminating iterative, cache-heavy pipelines and introducing a parallel Transformer-based framework (FSVG) that jointly processes visual and linguistic information. A language-guided feature selection mechanism ranks and prunes visual tokens using a similarity-based score, reducing computation while preserving language-relevant features. Empirical results across four benchmarks show that FSVG achieves strong accuracy with higher efficiency and fewer parameters than many state-of-the-art methods, with an effective trade-off controlled by the feature-selection ratio $\\rho$. The approach offers a practical, scalable baseline for real-time visual grounding with accessible code for replication.

Abstract

Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

A Simple and Better Baseline for Visual Grounding

TL;DR

The paper addresses visual grounding by eliminating iterative, cache-heavy pipelines and introducing a parallel Transformer-based framework (FSVG) that jointly processes visual and linguistic information. A language-guided feature selection mechanism ranks and prunes visual tokens using a similarity-based score, reducing computation while preserving language-relevant features. Empirical results across four benchmarks show that FSVG achieves strong accuracy with higher efficiency and fewer parameters than many state-of-the-art methods, with an effective trade-off controlled by the feature-selection ratio . The approach offers a practical, scalable baseline for real-time visual grounding with accessible code for replication.

Abstract

Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

Paper Structure

This paper contains 10 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of accuracy and efficiency on the widely-adopted RefCOCO val set. The circle size is proportional to the number of model parameters. As seen, our FSVG strikes a better balance between performance and inference speed with comparable model parameters. $\rho$ denotes the ratio of visual feature selection. The lower the value, the fewer visual features are selected for faster prediction.
  • Figure 2: Comparisons of different pipelines for visual grounding.
  • Figure 3: The entire architecture of the proposed FSVG which directly takes the concatenation of visual tokens and linguistic tokens as the input and consists of alternating vanilla Transformer layers and visual feature selection-based Transformer layers for faster localization prediction.
  • Figure 4: Diagram of Language-guided Visual Feature Selection.
  • Figure 5: Visualization of language-guided visual feature selection on CLIP-ViT-B based on the RefCOCO val set. For input image, the red bounding box is ground truth and the green box is the prediction of our proposed FSVG ($\rho=0.7$). The black patch is the discarded region which are decided by our proposed language-based visual feature selection process.