A Simple and Better Baseline for Visual Grounding

Jingchao Wang; Wenlong Zhang; Dingjiang Huang; Hong Wang; Yefeng Zheng

A Simple and Better Baseline for Visual Grounding

Jingchao Wang, Wenlong Zhang, Dingjiang Huang, Hong Wang, Yefeng Zheng

TL;DR

The paper addresses visual grounding by eliminating iterative, cache-heavy pipelines and introducing a parallel Transformer-based framework (FSVG) that jointly processes visual and linguistic information. A language-guided feature selection mechanism ranks and prunes visual tokens using a similarity-based score, reducing computation while preserving language-relevant features. Empirical results across four benchmarks show that FSVG achieves strong accuracy with higher efficiency and fewer parameters than many state-of-the-art methods, with an effective trade-off controlled by the feature-selection ratio $\\rho$. The approach offers a practical, scalable baseline for real-time visual grounding with accessible code for replication.

Abstract

Visual grounding aims to predict the locations of target objects specified by textual descriptions. For this task with linguistic and visual modalities, there is a latest research line that focuses on only selecting the linguistic-relevant visual regions for object localization to reduce the computational overhead. Albeit achieving impressive performance, it is iteratively performed on different image scales, and at every iteration, linguistic features and visual features need to be stored in a cache, incurring extra overhead. To facilitate the implementation, in this paper, we propose a feature selection-based simple yet effective baseline for visual grounding, called FSVG. Specifically, we directly encapsulate the linguistic and visual modalities into an overall network architecture without complicated iterative procedures, and utilize the language in parallel as guidance to facilitate the interaction between linguistic modal and visual modal for extracting effective visual features. Furthermore, to reduce the computational cost, during the visual feature learning, we introduce a similarity-based feature selection mechanism to only exploit language-related visual features for faster prediction. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that the proposed FSVG achieves a better balance between accuracy and efficiency beyond the current state-of-the-art methods. Code is available at https://github.com/jcwang0602/FSVG.

A Simple and Better Baseline for Visual Grounding

TL;DR

Abstract

A Simple and Better Baseline for Visual Grounding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)