Table of Contents
Fetching ...

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Chenyu Yang, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Lewei Lu, Bin Li, Jie Zhou, Yu Qiao, Jifeng Dai

TL;DR

This work addresses learning robust visual representations from interleaved image-text data, a prevalent but underutilized web data form. It introduces Latent Compression Learning (LCL), which treats pre-training as latent compression by maximizing the mutual information between the outputs $y$ of a causal attention model and the latent inputs $z$, i.e., $I(y; z)$, decomposed into a contrastive alignment with preceding context and an auto-regressive generation term. A Vision Transformer encoder maps images to latent tokens, which are integrated with text and processed by a causal LM; the objective combines a contrastive loss and a next-token generation loss as $L = \lambda L_{con} + L_{gen}$, facilitating scratch pre-training on interleaved data. Experiments on LAION-400M, MMC4, and OBELICS show that LCL matches CLIP on paired data and better exploits interleaved data, demonstrating the practicality and value of a compression-based perspective for vision-language pre-training. This approach suggests that broader interleaved data can unlock robust visual representations without heavy reliance on paired datasets, with significant implications for scalable, multi-modal learning.

Abstract

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representation from scratch, showcasing the potential of vision model pre-training with interleaved image-text data. Code is released at https://github.com/OpenGVLab/LCL.

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

TL;DR

This work addresses learning robust visual representations from interleaved image-text data, a prevalent but underutilized web data form. It introduces Latent Compression Learning (LCL), which treats pre-training as latent compression by maximizing the mutual information between the outputs of a causal attention model and the latent inputs , i.e., , decomposed into a contrastive alignment with preceding context and an auto-regressive generation term. A Vision Transformer encoder maps images to latent tokens, which are integrated with text and processed by a causal LM; the objective combines a contrastive loss and a next-token generation loss as , facilitating scratch pre-training on interleaved data. Experiments on LAION-400M, MMC4, and OBELICS show that LCL matches CLIP on paired data and better exploits interleaved data, demonstrating the practicality and value of a compression-based perspective for vision-language pre-training. This approach suggests that broader interleaved data can unlock robust visual representations without heavy reliance on paired datasets, with significant implications for scalable, multi-modal learning.

Abstract

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representation from scratch, showcasing the potential of vision model pre-training with interleaved image-text data. Code is released at https://github.com/OpenGVLab/LCL.
Paper Structure (18 sections, 11 equations, 3 figures, 11 tables)

This paper contains 18 sections, 11 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Comparison of different training frameworks.(a) Contrastive learning framework from CLIP radford2021learning pre-trains vision encoders from scratch with image-text pairs, but it does not support interleaved data. (b) Our proposed LCL pre-training frameworkm can effectively pre-train vision encoders from scratch with interleaved image-text data. In these two frameworks, the text encoder or the language model that provides supervision can be optionally discarded during the transfer stage. (c) Multi-modal incremental training process uses interleaved image-text data to align the pre-trained vision encoder and the language model, but it cannot pre-train vision encoders from scratch.
  • Figure 2: Overview of our proposed Latent Compression Learning for vision model pre-training. Image latent representation is extracted via a vision encoder and subsequently input into a language model alongside textual embedding. Two complementary losses are utilized to learn robust visual representation from scratch on interleaved image-text data: a contrastive loss ensures consistency between the visual latent representation and its preceding context, while an auto-regressive loss enhances the predictability of visual representation for subsequent text.
  • Figure 3: Illustration of "frozen transfer" evaluation. The vision encoder is frozen during transfer tuning. (a) Image classification: an attention probe and a linear classifier are built upon the vision encoder. (b) Image-text retrieval: an attention probe is used to extract global visual feature, which is trained to align with the text feature from the text encoder. (c) Text generation: an MLP is utilized to align the visual feature with the text embedding space, and the multi-modal embedding is fed into the language model for auto-regressive text generation.