Table of Contents
Fetching ...

Generalizable Entity Grounding via Assistance of Large Language Model

Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang

TL;DR

GELLA tackles dense grounding of entities described by long captions by coupling a colormap-based mask representation with a CLIP image encoder and an LLM-driven noun extractor. The framework introduces ResoBlend to fuse mask priors with image features and an association module to align semantic nouns with entity regions, enabling efficient, multi-entity grounding without requiring high-resolution image encoders. Through extensive experiments on panoptic narrative grounding, referring expression segmentation, and panoptic segmentation, GELLA achieves state-of-the-art or competitive results while reducing computational cost and maintaining flexibility to absorb outputs from diverse segmentation models. The approach demonstrates that leveraging offline masks and lightweight encoders, guided by an LLM, can effectively ground long captions in complex scenes with practical implications for interactive AI systems and multimodal analysis.

Abstract

In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

Generalizable Entity Grounding via Assistance of Large Language Model

TL;DR

GELLA tackles dense grounding of entities described by long captions by coupling a colormap-based mask representation with a CLIP image encoder and an LLM-driven noun extractor. The framework introduces ResoBlend to fuse mask priors with image features and an association module to align semantic nouns with entity regions, enabling efficient, multi-entity grounding without requiring high-resolution image encoders. Through extensive experiments on panoptic narrative grounding, referring expression segmentation, and panoptic segmentation, GELLA achieves state-of-the-art or competitive results while reducing computational cost and maintaining flexibility to absorb outputs from diverse segmentation models. The approach demonstrates that leveraging offline masks and lightweight encoders, guided by an LLM, can effectively ground long captions in complex scenes with practical implications for interactive AI systems and multimodal analysis.

Abstract

In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.
Paper Structure (27 sections, 10 equations, 6 figures, 12 tables)

This paper contains 27 sections, 10 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Top: Training data of GELLA. The training data includes the image, panoptic segmentation, and panoptic narrative grounding from the COCO dataset. Bottom: Inference pipeline of GELLA. During inference, we can use any LMM (e.g., GPT-4) to generate the caption and any class-agnostic segmentation model (e.g., EntitySeg) to generate the entity segmentation. The GELLA model can associate the two outputs and produce panoptic narrative grounding results.
  • Figure 2: Left: Overview of the GELLA framework. Given an image $\mathbf{I}$ and its corresponding caption $T$, we aim to generate a panoptic segmentation that densely grounds the semantic nouns in the caption. We first obtain an entity-level mask colormap $\mathbf{M_c}$ using a class-agnostic segmentation model. The mask and image are encoded by a mask encoder $f_{\Upsilon}$ and a CLIP vision encoder $f_{\Omega}$, respectively. The two extracted features are then fused by the ResoBlend module and fed into a mask decoder to reconstruct the mask. For the language part, we prepend an instruction to the caption $T$ for extracting semantic nouns and proceed with the tokenizer. The visual tokens $\mathbf{H_v}$ and textual tokens $\mathbf{H_t}$ are concatenated and fed into the language decoder $f_{\phi}$ to generate $<$SEG$>$ tokens as features of each semantic noun. The association module then computes the similarity between the embeddings of semantic nouns and visual entities. Right: Illustration of the ResoBlend Module.
  • Figure 3: Illustration of our user interface. The GELLA framework can perform three tasks: image description, semantic noun extraction, and narrative grounding. We provide more visualization results in our appendix.
  • Figure 4: Sample results of panoptic narrative grounding. The left part shows the input image and the long caption generated by our GELLA model, while the right part displays the panoptic narrative grounding results.
  • Figure 5: Sample results of panoptic narrative grounding. The left part shows the input image and the long caption generated by our GELLA model, while the right part displays the panoptic narrative grounding results.
  • ...and 1 more figures