CoLLaVO: Crayon Large Language and Vision mOdel
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
TL;DR
CoLLaVO addresses whether current vision-language models possess robust object-level image understanding and demonstrates that this capability strongly correlates with zero-shot VL performance. It introduces Crayon Prompt, derived from panoptic color maps, and a Dual QLoRA training scheme to fuse object-level instruction with visual instruction tuning while preserving previous capabilities. Empirical results show state-of-the-art zero-shot VL performance for a 7B-scale model and highlight the critical role of semantic and numbering embeddings, CPT, and CIT in grounding. The work suggests object-level grounding as a scalable pathway to robust cross-modal generalization and reduced hallucination in vision-language systems.
Abstract
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.
