Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

Yaxian Wang; Henghui Ding; Shuting He; Xudong Jiang; Bifan Wei; Jun Liu

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

Yaxian Wang, Henghui Ding, Shuting He, Xudong Jiang, Bifan Wei, Jun Liu

TL;DR

This work tackles Generalized Referring Expression Comprehension (GREC), which requires detecting arbitrary numbers of target objects, including zero targets, from free-form text. It introduces HieA2G, a Hierarchical Alignment-enhanced Adaptive Grounding Network that combines a Hierarchical Multi-modal Semantic Alignment (HMSA) module with an Adaptive Grounding Counter (AGC). HMSA enables word-object, phrase-object, and text-image alignments, aided by a text-mask recovery task and a phrase-object contrastive objective, while AGC dynamically predicts the number of outputs and employs a memory-augmented contrastive loss to improve object counting. Pretraining on merged datasets followed by finetuning on downstream tasks yields state-of-the-art results for GREC and strong performance on REC, phrase grounding, RES, and GRES, demonstrating strong generalizability and practical impact in flexible visual grounding scenarios.

Abstract

In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 4 figures, 6 tables)

This paper contains 16 sections, 10 equations, 4 figures, 6 tables.

Introduction
Related Work
Methodology
Architecture Overview
Hierarchical Multi-modal Semantic Alignment
Word-Object Alignment.
Phrase-Object Alignment.
Text-Image Alignment.
Adaptive Grounding Counter
Training Objective
Experiments
Experimental Setup
Performance Comparison
Ablation Study
Qualitative Analysis
...and 1 more sections

Figures (4)

Figure 1: Different visual grounding tasks. (a) Classic REC: text expressions can only specify a single object; (b) Phrase grounding detects all objects mentioned in expressions; (c) GREC he2023grecliu2023gres supports the text expressions indicating an arbitrary number of target objects from 0 to multiple, which is a more challenging task.
Figure 2: The framework of our proposed HieA2G. First, the visual encoder and the text encoder extract the visual feature $V_I$ and text feature $T_w$. Then, a Transformer encoder is employed to perform multi-modal feature interaction further. The learnable object queries and the output of the Transformer encoder are fed to the Transformer decoder, whose output is object embeddings $\mathcal{O}_e$ corresponding to the object queries. Next, based on $\mathcal{O}_e$, the Hierarchical Multi-modal Semantic Alignment (HMSA) module is employed to facilitate multi-level cross-modal interaction via word-object, phrase-object, and text-image alignment. Moreover, an Adaptive Grounding Counter (AGC) is utilized to decide the output number of target objects dynamically.
Figure 3: The detail of the Adaptive Grounding Counter.
Figure 4: Visualization for the success cases and failure cases of HieA2G on gRefCOCO dataset. The ground truth is denoted by red bounding boxes, whereas green bounding boxes denote the predictions. The $\mathrm{F_1}$ score of all success cases in (A) is 1.0.

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

TL;DR

Abstract

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

Authors

TL;DR

Abstract

Table of Contents

Figures (4)