Table of Contents
Fetching ...

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Yin Xie, Kaicheng Yang, Peirou Liang, Xiang An, Yongle Zhao, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng

TL;DR

ViCToR introduces a visual comprehension stage for pretraining LMMs, centering on a Learnable Visual Token Pool whose tokens are selected by Hungarian matching to replace image tokens. A visual token reconstruction loss plus dense semantic supervision from detailed captions preserves visual detail and strengthens cross-modal alignment, enabling the LLM to better understand visual information. Trained in a three-stage pipeline, ViCToR achieves state-of-the-art results across multiple benchmarks with notable data efficiency, surpassing LLaVA-NeXT-8B. The work highlights the importance of explicit token-level visual reconstruction and a learnable discrete visual vocabulary for robust vision-language modeling.

Abstract

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEED$^I$, and RealWorldQA benchmarks, respectively. Code is available at https://github.com/deepglint/Victor.

ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

TL;DR

ViCToR introduces a visual comprehension stage for pretraining LMMs, centering on a Learnable Visual Token Pool whose tokens are selected by Hungarian matching to replace image tokens. A visual token reconstruction loss plus dense semantic supervision from detailed captions preserves visual detail and strengthens cross-modal alignment, enabling the LLM to better understand visual information. Trained in a three-stage pipeline, ViCToR achieves state-of-the-art results across multiple benchmarks with notable data efficiency, surpassing LLaVA-NeXT-8B. The work highlights the importance of explicit token-level visual reconstruction and a learnable discrete visual vocabulary for robust vision-language modeling.

Abstract

Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model's (LLM's) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEED, and RealWorldQA benchmarks, respectively. Code is available at https://github.com/deepglint/Victor.

Paper Structure

This paper contains 12 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Traditional VLMs train language models to recognize visual tokens, while ViCToR instead replaces vision tokens with ones from a visual token pool, helping LLMs better understand and summarize images.
  • Figure 2: The training pipeline of our proposed ViCToR model. In contrast to LLaVA-1.5 liu2024llava_improved, we introduce an additional pre-training stage that involves visual token reconstruction and dense semantic supervision. This stage is essential for improving visual comprehension.
  • Figure 3: Qualitative Comparison of LLaVA-Next-8B and ViCToR-7B. Benefiting from our proposed cross-modal comprehension stage, the ViCToR model exhibits enhanced visual comprehension and reasoning capabilities. Moreover, it can generate enriched image descriptions.
  • Figure 4: We select and visualize image regions consisting of more than four contiguous local patches that exhibit the shortest distance to the same item in the VTP.
  • Figure 5: Performance comparison of different pre-training data scales and token pool sizes across various benchmark types.
  • ...and 1 more figures