ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su; Peihan Miao; Huanzhang Dou; Xi Li

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su, Peihan Miao, Huanzhang Dou, Xi Li

TL;DR

ScanFormer tackles the inefficiency of dense perception in Referring Expression Comprehension by introducing a coarse-to-fine iterative perception framework that traverses an image scale pyramid and discards linguistically irrelevant visual regions. It employs a unified vision-language Transformer with a multi-scale patch cache and a patch-selection mechanism based on constant token replacement and merging to reduce compute while predicting target boxes via per-scale [REG] tokens. Empirical results on RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame demonstrate competitive accuracy and real-time inference, outperforming several baselines and achieving strong efficiency gains. The work highlights the potential of iterative, scale-aware perception for vision-language tasks and paves the way for more flexible, efficient grounding systems.

Abstract

Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

TL;DR

Abstract

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)