CoLLaVO: Crayon Large Language and Vision mOdel

Byung-Kwan Lee; Beomchan Park; Chae Won Kim; Yong Man Ro

CoLLaVO: Crayon Large Language and Vision mOdel

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

TL;DR

CoLLaVO addresses whether current vision-language models possess robust object-level image understanding and demonstrates that this capability strongly correlates with zero-shot VL performance. It introduces Crayon Prompt, derived from panoptic color maps, and a Dual QLoRA training scheme to fuse object-level instruction with visual instruction tuning while preserving previous capabilities. Empirical results show state-of-the-art zero-shot VL performance for a 7B-scale model and highlight the critical role of semantic and numbering embeddings, CPT, and CIT in grounding. The work suggests object-level grounding as a scalable pathway to robust cross-modal generalization and reduced hallucination in vision-language systems.

Abstract

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

CoLLaVO: Crayon Large Language and Vision mOdel

TL;DR

Abstract

Paper Structure (21 sections, 8 figures, 6 tables)

This paper contains 21 sections, 8 figures, 6 tables.

Introduction
Research Backgrounds
Visual Prompting.
LLMs, VLMs, and Instruction Tuning.
CoLLaVO
Model Architecture and Prompt Protocol.
Crayon Prompt Tuning (CPT).
Crayon Prompt-based Instruction Tuning (CIT).
Experiments
Implementation Details of CoLLaVO.
Object-level Image Understanding.
Zero-shot VL Evaluation.
The effectiveness of Crayon Prompt and CIT.
Discussion and Conclusion
Limitations
...and 6 more sections

Figures (8)

Figure 1: Zero-shot performance of CoLLaVO-7B on challenging VL datasets compared with closed-source VLMs gptsyscardgpttechnicalteam2023geminibai2023qwen. Note: The scores of MME are rescaled by $1/20$ to match the scales with the accuracies of others.
Figure 2: Asking four baselines (BLIP2, InstructBLIP, Qwen-VL, and LLaVA1.5) two types of questions, Class2Binary (C2B) and Box2Class (B2C), and measuring their accuracies on each object category.
Figure 3: Plotting the regressed relationships between (a) C2B and B2C for each object category, (b) the average of C2B & B2C and zero-shot GQA hudson2019gqa performance for each object category, (c) the average of C2B & B2C and zero-shot TextVQA singh2019towards performance for each object category to visualize their correlations. The light-colored areas indicate the vertical span with the probability of confidence interval 0.95.
Figure 4: Overview of two-step training for CoLLaVO. Note that 'Vision' represents vision encoder, and that the fire symbols represent the modules to learn.
Figure 5: Describing how the Crayon Prompt is generated from a panoptic color map with learnable semantic queries and numbering queries. In addition, crayon instruction examples are given, which are used to conduct CPT and CIT. Note that, '{}' denotes the place where we adaptively input information.
...and 3 more figures

CoLLaVO: Crayon Large Language and Vision mOdel

TL;DR

Abstract

CoLLaVO: Crayon Large Language and Vision mOdel

Authors

TL;DR

Abstract

Table of Contents

Figures (8)