Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension
Yaxian Wang, Henghui Ding, Shuting He, Xudong Jiang, Bifan Wei, Jun Liu
TL;DR
This work tackles Generalized Referring Expression Comprehension (GREC), which requires detecting arbitrary numbers of target objects, including zero targets, from free-form text. It introduces HieA2G, a Hierarchical Alignment-enhanced Adaptive Grounding Network that combines a Hierarchical Multi-modal Semantic Alignment (HMSA) module with an Adaptive Grounding Counter (AGC). HMSA enables word-object, phrase-object, and text-image alignments, aided by a text-mask recovery task and a phrase-object contrastive objective, while AGC dynamically predicts the number of outputs and employs a memory-augmented contrastive loss to improve object counting. Pretraining on merged datasets followed by finetuning on downstream tasks yields state-of-the-art results for GREC and strong performance on REC, phrase grounding, RES, and GRES, demonstrating strong generalizability and practical impact in flexible visual grounding scenarios.
Abstract
In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.
