Table of Contents
Fetching ...

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, Linghan Xu, Zhuowen Tu, Yifan Xing, Onkar Dabeer

TL;DR

Ground-V introduces a data-centric approach to teach vision–language models to ground complex pixel-level instructions. By automatically generating 500K instruction–segmentation pairs through a teacher–student workflow and minimal human validation, Ground-V scales rich referring expressions across five real-world challenges. Finetuning LISA and PSALM on Ground-V yields state-of-the-art performance on RefCOCO/+/g and gRefCOCO, with notable gains on a challenging Ground-V test set ($gIoU$ up to 70.6% and $N$-$Acc$ up to 83.7% on gRefCOCO) and improved robustness to language priors. The results demonstrate the practical impact of scalable, richly annotated grounding data for enhancing pixel-level localization under complex instructions, while highlighting avenues for reducing data-noise and extending benefits to generalist VLMs.

Abstract

This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

TL;DR

Ground-V introduces a data-centric approach to teach vision–language models to ground complex pixel-level instructions. By automatically generating 500K instruction–segmentation pairs through a teacher–student workflow and minimal human validation, Ground-V scales rich referring expressions across five real-world challenges. Finetuning LISA and PSALM on Ground-V yields state-of-the-art performance on RefCOCO/+/g and gRefCOCO, with notable gains on a challenging Ground-V test set ( up to 70.6% and - up to 83.7% on gRefCOCO) and improved robustness to language priors. The results demonstrate the practical impact of scalable, richly annotated grounding data for enhancing pixel-level localization under complex instructions, while highlighting avenues for reducing data-noise and extending benefits to generalist VLMs.

Abstract

This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

Paper Structure

This paper contains 21 sections, 14 figures, 16 tables.

Figures (14)

  • Figure 1: Performance comparison of LISA and PSALM models w/ and w/o our Ground-V dataset during training. Incorporating Ground-V consistently enhances both models' performance across benchmarks, achieving an average improvement of 4.4% for LISA and 7.9% for PSALM on the gIoU metric across six benchmarks.
  • Figure 2: Illustration of the diverse scenarios covered in our Ground-V dataset. These include reasoning-based segmentation (top left), multi-granular instructions (center), multi-object and hallucination handling (top right), and part-whole relationships (bottom left). Each example demonstrates how our dataset provides rich, nuanced instructions and corresponding segmentations.
  • Figure 3: (a) Overview of the G5 data generation pipeline. We identify key real-world challenges, including hallucination, multi-object scenarios, reasoning, multi-granularity, and part-level reference. For each dimension, we design few-shot prompts to generate instruction-response pairs. Using these prompts alongside corresponding images, we guide Claude to produce instruction-response pairs. Finally, human annotators validate the evaluation set to ensure accuracy and reliability. (b) Here we give an example few-shot prompt for generating hallucination-mitigation data, covering three levels: object-level, attribute-level, and relation-level. (c) We display a sample of the generated instructions and responses, paired with segmentation masks from the original data source, which supports training and evaluation. For hallucination-mitigation data, however, no segmentation mask is included because the instruction is designed to mislead the model deliberately; a correctly functioning model should not produce a segmentation mask in this scenario.
  • Figure 4: Our model demonstrates improved understanding of object attributes, enhanced reasoning capabilities, more precise segmentation of smaller object parts, and the ability to reject instructions when the target object is not present in the image.
  • Figure 5: Performance of models trained with different proportions of Ground-V. The average performance is calculated by averaging the gIoU of all subsets.
  • ...and 9 more figures