Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee, Sanghyuk Chun, Sangdoo Yun
TL;DR
RegionVLM tackles the limitation of Vision-Language Pretraining models in fine-grained regional understanding by enabling explicit, user-indicated region grounding without architectural changes. It leverages the Localized Narratives dataset and a simple tokenization of scribbles to condition a Q-Former, guiding a frozen LLM to generate region-specific captions while preserving global image comprehension. The approach delivers interactive dialogue, strong zero-shot regional understanding across RIS, VCR, and VQA tasks, and improved zero-shot captioning, demonstrating a practical path toward region-aware, generalist VL models. This work suggests fruitful avenues for further enhancement via instruction tuning and broader regional grounding capabilities.
Abstract
Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding.
