Table of Contents
Fetching ...

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su, Peihan Miao, Huanzhang Dou, Xi Li

TL;DR

ScanFormer tackles the inefficiency of dense perception in Referring Expression Comprehension by introducing a coarse-to-fine iterative perception framework that traverses an image scale pyramid and discards linguistically irrelevant visual regions. It employs a unified vision-language Transformer with a multi-scale patch cache and a patch-selection mechanism based on constant token replacement and merging to reduce compute while predicting target boxes via per-scale [REG] tokens. Empirical results on RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame demonstrate competitive accuracy and real-time inference, outperforming several baselines and achieving strong efficiency gains. The work highlights the potential of iterative, scale-aware perception for vision-language tasks and paves the way for more flexible, efficient grounding systems.

Abstract

Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

TL;DR

ScanFormer tackles the inefficiency of dense perception in Referring Expression Comprehension by introducing a coarse-to-fine iterative perception framework that traverses an image scale pyramid and discards linguistically irrelevant visual regions. It employs a unified vision-language Transformer with a multi-scale patch cache and a patch-selection mechanism based on constant token replacement and merging to reduce compute while predicting target boxes via per-scale [REG] tokens. Empirical results on RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame demonstrate competitive accuracy and real-time inference, outperforming several baselines and achieving strong efficiency gains. The work highlights the potential of iterative, scale-aware perception for vision-language tasks and paves the way for more flexible, efficient grounding systems.

Abstract

Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.

Paper Structure

This paper contains 23 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The comparison of dense perception and coarse-to-fine iterative perception. The dense perception extracts features with sliding windows or non-overlapping patches by traversing the image. In contrast, our iterative perception can identify and discard linguistic-irrelevant redundant regions from coarse to fine scales.
  • Figure 2: The overall architecture of ScanFormer. The text inputs and image patches at each scale share the encoder. The outputs of the first half part of the encoder, i.e. Encoder1, are used to select finer-grained patches for the next level. The [REG] tokens output by the second half of the encoder, i.e. Encoder2, are used to predict the coordinates of the referred object at the corresponding scale. The key and value features generated in the encoder are cached and propagated from left to right.
  • Figure 3: Token interaction of different modalities and scales. "dark" color means blocking interaction. The regions surrounded by blue dotted lines represent the interaction in each iteration, and the regions surrounded by orange dotted lines represent the interaction with the K&V cache.
  • Figure 4: Comparison of the performance and inference speed on the val set of RefCOCO+ refcoco. The real-time speed threshold is set to 25 FPS and all inference speeds all tested on the 1080 Ti.
  • Figure 5: The Acc@0.5 and IoUs between predicted bounding boxes and ground truth of three scales, which are evaluated on the val set of RefCOCOgrefcocog-umd.
  • ...and 3 more figures