Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin
TL;DR
This work tackles graph-based referring expression comprehension (REC) by addressing detector-induced noise and localization errors. It introduces a four-module framework: a language parser to produce sub-expressions, a bimodal graph attention module with visual and categorical graphs, dynamic gate constraints (DGC) to selectively prune nodes and edges during iterative reasoning, and an expression-guided regression (EGR) to refine the target bounding box. The DGC gates are guided by sub-expressions, computing correlations and activating only relevant graph components, while EGR leverages both expression and graph features to improve localization. Across six REC datasets, the proposed approach achieves competitive or superior results to state-of-the-art transformer-based models without any pre-training, highlighting the viability and efficiency of graph-based REC when equipped with DGC and EGR.
Abstract
One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.
