Table of Contents
Fetching ...

CoLLaVO: Crayon Large Language and Vision mOdel

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

TL;DR

CoLLaVO addresses whether current vision-language models possess robust object-level image understanding and demonstrates that this capability strongly correlates with zero-shot VL performance. It introduces Crayon Prompt, derived from panoptic color maps, and a Dual QLoRA training scheme to fuse object-level instruction with visual instruction tuning while preserving previous capabilities. Empirical results show state-of-the-art zero-shot VL performance for a 7B-scale model and highlight the critical role of semantic and numbering embeddings, CPT, and CIT in grounding. The work suggests object-level grounding as a scalable pathway to robust cross-modal generalization and reduced hallucination in vision-language systems.

Abstract

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

CoLLaVO: Crayon Large Language and Vision mOdel

TL;DR

CoLLaVO addresses whether current vision-language models possess robust object-level image understanding and demonstrates that this capability strongly correlates with zero-shot VL performance. It introduces Crayon Prompt, derived from panoptic color maps, and a Dual QLoRA training scheme to fuse object-level instruction with visual instruction tuning while preserving previous capabilities. Empirical results show state-of-the-art zero-shot VL performance for a 7B-scale model and highlight the critical role of semantic and numbering embeddings, CPT, and CIT in grounding. The work suggests object-level grounding as a scalable pathway to robust cross-modal generalization and reduced hallucination in vision-language systems.

Abstract

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.
Paper Structure (21 sections, 8 figures, 6 tables)

This paper contains 21 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Zero-shot performance of CoLLaVO-7B on challenging VL datasets compared with closed-source VLMs gptsyscardgpttechnicalteam2023geminibai2023qwen. Note: The scores of MME are rescaled by $1/20$ to match the scales with the accuracies of others.
  • Figure 2: Asking four baselines (BLIP2, InstructBLIP, Qwen-VL, and LLaVA1.5) two types of questions, Class2Binary (C2B) and Box2Class (B2C), and measuring their accuracies on each object category.
  • Figure 3: Plotting the regressed relationships between (a) C2B and B2C for each object category, (b) the average of C2B & B2C and zero-shot GQA hudson2019gqa performance for each object category, (c) the average of C2B & B2C and zero-shot TextVQA singh2019towards performance for each object category to visualize their correlations. The light-colored areas indicate the vertical span with the probability of confidence interval 0.95.
  • Figure 4: Overview of two-step training for CoLLaVO. Note that 'Vision' represents vision encoder, and that the fire symbols represent the modules to learn.
  • Figure 5: Describing how the Crayon Prompt is generated from a panoptic color map with learnable semantic queries and numbering queries. In addition, crayon instruction examples are given, which are used to conduct CPT and CIT. Note that, '{}' denotes the place where we adaptively input information.
  • ...and 3 more figures