From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

Mingxiao Li; Fang Qu; Zhanpeng Chen; Na Su; Zhizhou Zhong; Ziyang Chen; Nan Du; Xiaolong Li

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

Mingxiao Li, Fang Qu, Zhanpeng Chen, Na Su, Zhizhou Zhong, Ziyang Chen, Nan Du, Xiaolong Li

TL;DR

This work tackles the weak multimodal alignment in autoregressively pre-trained MLLMs by reframing alignment as an information-recovery task and introducing Vision Dynamic Embedding-Guided Pre-training (VDEP). VDEP adds a dynamic image-embedding reconstruction objective to the autoregressive training, incorporating image tokens into the text-conditioned learning process without architectural changes. The method uses a hybrid training schedule that balances text-focused and image-focused objectives via a tunable weight $\alpha$ and a 1:1 batch mix, achieving state-of-the-art or competitive results across 13 benchmarks, including notable improvements on RealWorldQA and VizWizQA. The approach yields clearer visual representations and reduced hallucinations, demonstrating practical impact for robust, cross-modal reasoning in MLLMs; future work aims to adaptively adjust $\alpha$ and optimize data-efficiency further.

Abstract

While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual features.The proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

TL;DR

Abstract

From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)