Table of Contents
Fetching ...

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

Mingxiao Li, Fang Qu, Zhanpeng Chen, Na Su, Zhizhou Zhong, Ziyang Chen, Nan Du, Xiaolong Li

TL;DR

This work tackles the weak multimodal alignment in autoregressively pre-trained MLLMs by reframing alignment as an information-recovery task and introducing Vision Dynamic Embedding-Guided Pre-training (VDEP). VDEP adds a dynamic image-embedding reconstruction objective to the autoregressive training, incorporating image tokens into the text-conditioned learning process without architectural changes. The method uses a hybrid training schedule that balances text-focused and image-focused objectives via a tunable weight $\alpha$ and a 1:1 batch mix, achieving state-of-the-art or competitive results across 13 benchmarks, including notable improvements on RealWorldQA and VizWizQA. The approach yields clearer visual representations and reduced hallucinations, demonstrating practical impact for robust, cross-modal reasoning in MLLMs; future work aims to adaptively adjust $\alpha$ and optimize data-efficiency further.

Abstract

While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual features.The proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

TL;DR

This work tackles the weak multimodal alignment in autoregressively pre-trained MLLMs by reframing alignment as an information-recovery task and introducing Vision Dynamic Embedding-Guided Pre-training (VDEP). VDEP adds a dynamic image-embedding reconstruction objective to the autoregressive training, incorporating image tokens into the text-conditioned learning process without architectural changes. The method uses a hybrid training schedule that balances text-focused and image-focused objectives via a tunable weight and a 1:1 batch mix, achieving state-of-the-art or competitive results across 13 benchmarks, including notable improvements on RealWorldQA and VizWizQA. The approach yields clearer visual representations and reduced hallucinations, demonstrating practical impact for robust, cross-modal reasoning in MLLMs; future work aims to adaptively adjust and optimize data-efficiency further.

Abstract

While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual features.The proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.

Paper Structure

This paper contains 20 sections, 10 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Layer-wise attention visualization of visual-to-instruction information flow. The example is derived from LLava-Bench liu2024improved and the query is "Describe this photo in detail". The visualization results demonstrate that VDEP significantly enhances the model's ability to capture critical features in images, with particularly outstanding performance in identifying object boundaries.
  • Figure 2: The LLava-VDEP network architecture incorporates two distinct training modes. The VDEP mode performs supervised learning on image data, while the LLava mode is dedicated to supervised learning on text data. During batch training, a ratio parameter is used to control the proportional occurrence of these two modes within each batch, enabling an effective balance in the learning process.
  • Figure 3: Illustration of our VDEP derivation process. (a) Text Pre-training: Convert text into embeddings using tokenization. The LLM generates hidden states, which are processed by the LM head to produce predicted tokens. Compute cross-entropy loss with the original input.(b) Image Pre-training: Divide images into patches. Convert patches into embeddings using visual branches without real labels. These embeddings guide the LLM hidden states to reconstruct image information.
  • Figure 4: Layer-wise attention visualization of visual-to-instruction information flow. Displayed from top to bottom are the attention heatmaps from LLava-v1.5-7B and LLava-v1.5-7B-VDEP, respectively. The example is derived from LLava-Bench (Liu et al., 2024b) and the query is "Describe this photo in detail".