Generalizable Entity Grounding via Assistance of Large Language Model
Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang
TL;DR
GELLA tackles dense grounding of entities described by long captions by coupling a colormap-based mask representation with a CLIP image encoder and an LLM-driven noun extractor. The framework introduces ResoBlend to fuse mask priors with image features and an association module to align semantic nouns with entity regions, enabling efficient, multi-entity grounding without requiring high-resolution image encoders. Through extensive experiments on panoptic narrative grounding, referring expression segmentation, and panoptic segmentation, GELLA achieves state-of-the-art or competitive results while reducing computational cost and maintaining flexibility to absorb outputs from diverse segmentation models. The approach demonstrates that leveraging offline masks and lightweight encoders, guided by an LLM, can effectively ground long captions in complex scenes with practical implications for interactive AI systems and multimodal analysis.
Abstract
In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.
