Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

Jingcheng Ke; Dele Wang; Jun-Cheng Chen; I-Hong Jhuo; Chia-Wen Lin; Yen-Yu Lin

Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin

TL;DR

This work tackles graph-based referring expression comprehension (REC) by addressing detector-induced noise and localization errors. It introduces a four-module framework: a language parser to produce sub-expressions, a bimodal graph attention module with visual and categorical graphs, dynamic gate constraints (DGC) to selectively prune nodes and edges during iterative reasoning, and an expression-guided regression (EGR) to refine the target bounding box. The DGC gates are guided by sub-expressions, computing correlations and activating only relevant graph components, while EGR leverages both expression and graph features to improve localization. Across six REC datasets, the proposed approach achieves competitive or superior results to state-of-the-art transformer-based models without any pre-training, highlighting the viability and efficiency of graph-based REC when equipped with DGC and EGR.

Abstract

One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.

Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

TL;DR

Abstract

Paper Structure (22 sections, 15 equations, 6 figures, 7 tables)

This paper contains 22 sections, 15 equations, 6 figures, 7 tables.

Introduction
Related Works
Proposed Method
Overview of Our Method
Language Parser
Bimodal Graph Attention Module
Node Features
Edge Features
Gating States
Reasoning with Dynamic Gate Constraints
Dynamic Gate Constraint (DGC)
Reasoning on Bimodal Sub-Graphs
Node Weights:
Edge Weights:
Matching and Expression-guided Regression
...and 7 more sections

Figures (6)

Figure 1: We employ a language parser to decompose a given expression into sub-expressions, each of which is related to a few objects. This helps the DGC module deactivate irrelevant objects during reasoning. An expression-guided regression (EGR) strategy is devised to refine the target object location. The red, blue, and green boxes denote the proposals, the proposal best matching the ground truth, and the refined REC output, respectively.
Figure 2: Overview of our graph-based REC framework with sub-expressions guided DGC. (a) The flow of activating relevant candidate objects during reasoning adaptively. Bigger nodes in graphs denote larger weights. (b) The DGC module. (c) The EGR module, where $\textbf{C}$ represents the concatenation operation. Bigger nodes in graphs denote larger weights. $v_{i}^{a}$ represents the visual feature of the node with the highest score in the visual graph, while $v_{i}^{c}$ represents the corresponding categorical feature in the categorical graph. $\textbf{q}$ represents the text feature of the entire expression. Refer main text for more details.
Figure 3: Illustration of the reasoning process of our method. As the reasoning steps increase, more sub-expressions are extracted for guidance, following a specific order. In the second and third reasoning steps, our method correctly locates the target object. Finally, we utilize the EGR strategy to refine the predicted bounding box.
Figure 4: Visualizations of the proposed method in different cases. (a), (b) and (c) show the results of our model without EGR, our model without DGC, and our network, respectively.
Figure 5: Qualitative results of the proposed method. The first, second, and third rows show the results of our method without DGC and EGR, our method, and the processing sequence of sub-expressions, respectively. The words in blue represent the currently processed sub-expression.
...and 1 more figures

Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

TL;DR

Abstract

Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

Authors

TL;DR

Abstract

Table of Contents

Figures (6)