Table of Contents
Fetching ...

LaVin-DiT: Large Vision Diffusion Transformer

Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu

TL;DR

<3-5 sentence high-level summary> LaVin-DiT addresses the need for a scalable, unified foundation model for vision by marrying a spatial-temporal latent encoder (ST-VAE) with a joint diffusion transformer (J-DiT) and enabling in-context, task-conditioned generation. It preserves spatial-temporal coherence through 3D RoPE and full-sequence joint attention while operating in a compact latent space to handle 20 diverse vision tasks without fine-tuning. Empirical results show strong cross-task performance, faster inference than prior large vision models, and clear gains with increasing model size and task-context length. This work advances a pathway toward generalist, diffusion-based vision models capable of open-world understanding and generation.

Abstract

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models are available.

LaVin-DiT: Large Vision Diffusion Transformer

TL;DR

<3-5 sentence high-level summary> LaVin-DiT addresses the need for a scalable, unified foundation model for vision by marrying a spatial-temporal latent encoder (ST-VAE) with a joint diffusion transformer (J-DiT) and enabling in-context, task-conditioned generation. It preserves spatial-temporal coherence through 3D RoPE and full-sequence joint attention while operating in a compact latent space to handle 20 diverse vision tasks without fine-tuning. Empirical results show strong cross-task performance, faster inference than prior large vision models, and clear gains with increasing model size and task-context length. This work advances a pathway toward generalist, diffusion-based vision models capable of open-world understanding and generation.

Abstract

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models are available.

Paper Structure

This paper contains 22 sections, 6 equations, 30 figures, 4 tables, 2 algorithms.

Figures (30)

  • Figure 1: Comparison of autoregressive and diffusion modeling. (a) In autoregressive modeling, visual data is divided into a sequence of patches and transformed into a one-dimensional sequence. The model then predicts each token sequentially from left to right and top to bottom, which is computationally intensive for high-dimensional visual data. Besides, tokens marked in red and blue illustrate disrupted spatial dependencies, highlighting the limitations of preserving spatial coherence. (b) In contrast, diffusion modeling denoises all tokens in parallel across $N$ timesteps, significantly improving computational efficiency and preserving essential spatial structures crucial for high-performance vision tasks.
  • Figure 2: Overview of Large Vision Diffusion Model (LaVin-DiT). As shown in panel (a), the model initially compresses input visual data from the pixel space into a latent space, where multiple input-target pairs serve as the task context. A target is perturbed with Gaussian noise through a diffusion process. Guided by the task context and query, the Joint Diffusion Transformer (J-DiT) iteratively denoises this noisy target over $N$ timesteps to recover a clean latent representation. The prediction is then generated via the ST-VAE decoder. Panels (b) and (c) provide architectural details of the ST-VAE and J-DiT, respectively. "Down." and "Up." indicate the downsampling and upsampling, respectively. Concatenation is represented by $\odot$.
  • Figure 3: Qualitative results on diverse image and video-based tasks. The first ten rows show image-based tasks, where each row contains a sequence of images interleaved with annotations, followed by a query. The last image is predicted by the model (marked in red). The last four rows show video-based tasks, where each row includes a video sequence with a series of target frames as task context, followed by a query frame. A set of frames in the red box indicates the model’s predictions. Best viewed in color.
  • Figure 4: Training loss curves for LaVin-DiT of varying model sizes. The 3.4B model demonstrates faster convergence, achieving lower training losses than smaller models as training progresses.
  • Figure 5: Performance for LaVin-DiT of varying sizes. Comparison of LaVin-DiT with different parameters on colorization (MSE) and depth estimation (AbsRel). Lower values indicate better performance.
  • ...and 25 more figures