Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong; Qin Zhang; Dongsheng An; Zhihua Li; Xiang Xu; Linghan Xu; Zhuowen Tu; Yifan Xing; Onkar Dabeer

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, Linghan Xu, Zhuowen Tu, Yifan Xing, Onkar Dabeer

TL;DR

Ground-V introduces a data-centric approach to teach vision–language models to ground complex pixel-level instructions. By automatically generating 500K instruction–segmentation pairs through a teacher–student workflow and minimal human validation, Ground-V scales rich referring expressions across five real-world challenges. Finetuning LISA and PSALM on Ground-V yields state-of-the-art performance on RefCOCO/+/g and gRefCOCO, with notable gains on a challenging Ground-V test set ($gIoU$ up to 70.6% and $N$-$Acc$ up to 83.7% on gRefCOCO) and improved robustness to language priors. The results demonstrate the practical impact of scalable, richly annotated grounding data for enhancing pixel-level localization under complex instructions, while highlighting avenues for reducing data-noise and extending benefits to generalist VLMs.

Abstract

This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

TL;DR

Abstract

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)