Table of Contents
Fetching ...

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, Hanwang Zhang

TL;DR

This work introduces Discrete Diffusion Timestep (DDT) tokens to form a recursive visual language for multimodal pretraining, addressing the language-like structure missing in traditional spatial tokens. By encoding images as expanding sequences of discrete tokens derived from the diffusion process and integrating them with an LLM, the authors create DDT-LLaMA, an encoder-free MLLM trained in two stages (pretraining and instruction tuning) to handle both visual understanding and generation. Extensive experiments across text-to-image generation, image editing, and vision-language comprehension demonstrate strong performance, with analyses confirming the recursive, attribute-disentangled nature of DDT tokens and evidence of scaling laws. The approach achieves competitive or state-of-the-art results among MLLMs and diffusion-based specialists, highlighting its potential for unified multimodal AI systems and suggesting avenues for scaling and broader domain coverage.

Abstract

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

TL;DR

This work introduces Discrete Diffusion Timestep (DDT) tokens to form a recursive visual language for multimodal pretraining, addressing the language-like structure missing in traditional spatial tokens. By encoding images as expanding sequences of discrete tokens derived from the diffusion process and integrating them with an LLM, the authors create DDT-LLaMA, an encoder-free MLLM trained in two stages (pretraining and instruction tuning) to handle both visual understanding and generation. Extensive experiments across text-to-image generation, image editing, and vision-language comprehension demonstrate strong performance, with analyses confirming the recursive, attribute-disentangled nature of DDT tokens and evidence of scaling laws. The approach achieves competitive or state-of-the-art results among MLLMs and diffusion-based specialists, highlighting its potential for unified multimodal AI systems and suggesting avenues for scaling and broader domain coverage.

Abstract

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs. Project Page: https://DDT-LLaMA.github.io/.

Paper Structure

This paper contains 47 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Auto-regressive training curves of diffusion timestep tokens (left) and spatial tokens (right) under different degrees of sequence perturbation.
  • Figure 2: The overview of our methods. (a): The architecture of diffusion timestep tokenizer encodes an image to a recursive sequence of discrete tokens. (b): An MLLM architecture that unifies comprehension and generation based on next token prediction.
  • Figure 3: Qualitative results of DDT-LLaMA text-to-image generation.
  • Figure 4: Qualitative comparison with EMU3 on T2I generation. DDT-LLaMA better responses to prompts related to counting or position.
  • Figure 5: Qualitative comparison on image editing.
  • ...and 12 more figures