Table of Contents
Fetching ...

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, Joyce Chai

TL;DR

Groundhog advances vision-language grounding by replacing bounding-box grounding with pixel-level holistic segmentation, enabling fine-grained grounding across objects, parts, and text. The approach combines a masked feature extractor with a flexible mask proposal system (Mask2Former+) and a grounding mechanism that ties language to segmentation masks via GRD tokens, producing groundable masks with a transparent scoring process. Training relies on M3G2, a 2.5M-pair grounded instruction-tuning dataset spanning four task types derived from 27 datasets, enabling broad grounding capabilities without task-specific fine-tuning. Empirical results show improved grounding, reduced object hallucination, and strong generalization across tasks, with added benefits in explainability and failure diagnosis, making Groundhog a versatile generalist for grounded vision-language tasks.

Abstract

Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations that are important for fine-grained visual understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone, which then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by harvesting a collection of segmentation-grounded datasets with rich annotations. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning, and significantly reduces object hallucination. GROUNDHOG also demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases.

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

TL;DR

Groundhog advances vision-language grounding by replacing bounding-box grounding with pixel-level holistic segmentation, enabling fine-grained grounding across objects, parts, and text. The approach combines a masked feature extractor with a flexible mask proposal system (Mask2Former+) and a grounding mechanism that ties language to segmentation masks via GRD tokens, producing groundable masks with a transparent scoring process. Training relies on M3G2, a 2.5M-pair grounded instruction-tuning dataset spanning four task types derived from 27 datasets, enabling broad grounding capabilities without task-specific fine-tuning. Empirical results show improved grounding, reduced object hallucination, and strong generalization across tasks, with added benefits in explainability and failure diagnosis, making Groundhog a versatile generalist for grounded vision-language tasks.

Abstract

Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations that are important for fine-grained visual understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone, which then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by harvesting a collection of segmentation-grounded datasets with rich annotations. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning, and significantly reduces object hallucination. GROUNDHOG also demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases.
Paper Structure (42 sections, 1 equation, 21 figures, 16 tables)

This paper contains 42 sections, 1 equation, 21 figures, 16 tables.

Figures (21)

  • Figure 1: We propose Groundhog, a multimodal large language model that enhances its text output with pixel-level phrase grounding across diverse semantic granularities. The figure demonstrates outputs from our model on the four task types we considered in this work.
  • Figure 2: The model architecture of Groundhog model. Given a set of class-agnostic entity mask proposals, the masked feature extractor first extracts the feature of each entity as the visual input of the multi-modal large language model (left). The output hidden states of the grounding tokens are averaged and used to retrieve the entities to ground, which will be merged into a single grounding mask for the phrase. Modules are colored by their trainability: parameter-free operators (grey), frozen (blue), trainable (orange), and partially trainable (mix).
  • Figure 3: Groundhog can take arbitrary spatial prompts that can be resolved by an interactive segmentation model, such as SAM. The placeholder pointer token <PTR> will be replaced by the extracted entity features and fed as input to the model.
  • Figure 4: The M3G2 dataset for grounded visual instruction tuning. M3G2 is a diverse dataset of multiple granularities, unifying 4 different task types with visually grounded dialogue.
  • Figure 5: Examples of Groundhog's performance in grounded image captioning.
  • ...and 16 more figures