Table of Contents
Fetching ...

One Diffusion to Generate Them All

Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu

TL;DR

<3-5 sentence high-level summary of the paper> OneDiffusion presents a universal diffusion framework that treats all conditioning and target images as a sequence of views with varying noise, enabling bidirectional generation and understanding across text-to-image, image-to-image, ID customization, and multiview tasks. Built on a flow-matching objective and a Next-DiT transformer architecture, it trains from scratch on a large, heterogeneous One-Gen dataset and naturally supports different resolutions, including high-resolution 1024^2 outputs, without task-specific modules. The approach demonstrates competitive performance on generation and predictive tasks (depth, pose, segmentation) and shows strong generalization in zero-shot task composition, multi-view generation, and personalization. Together, these results push toward a general-purpose vision model that can serve as a flexible backbone for a wide range of applications.

Abstract

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion

One Diffusion to Generate Them All

TL;DR

<3-5 sentence high-level summary of the paper> OneDiffusion presents a universal diffusion framework that treats all conditioning and target images as a sequence of views with varying noise, enabling bidirectional generation and understanding across text-to-image, image-to-image, ID customization, and multiview tasks. Built on a flow-matching objective and a Next-DiT transformer architecture, it trains from scratch on a large, heterogeneous One-Gen dataset and naturally supports different resolutions, including high-resolution 1024^2 outputs, without task-specific modules. The approach demonstrates competitive performance on generation and predictive tasks (depth, pose, segmentation) and shows strong generalization in zero-shot task composition, multi-view generation, and personalization. Together, these results push toward a general-purpose vision model that can serve as a flexible backbone for a wide range of applications.

Abstract

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion

Paper Structure

This paper contains 42 sections, 6 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: OneDiffusion is a unified diffusion model designed for both image synthesis and understanding across diverse tasks. It supports text-to-image generation (red box), conditional image generation from input images (orange box) and it's reverse task Image understanding (green box). It can also perform ID customization (blue box), and multi-view generation (purple box) with arbitrary number of input and output images.
  • Figure 2: Illustration of training and inference pipeline for OneDiffusion. We encode the desired task for each sample via a special task token. During training we independently sample different diffusion timesteps for each view and add noise to them accordingly. In inference, we replace input image(s) with Gaussian noises while setting timesteps of conditions to $0$.
  • Figure 3: High-resolution samples from text of our OneDiffusion model, showcasing its capabilities in precise prompt adherence, attention to fine details, and high image quality across a wide variety of styles.
  • Figure 4: Illustration of our model capability to generate HED, depth, human pose, semantic mask, and bounding box from input image. For semantic segmentation, we segment the sword (highlighted in yellow) and the moon (highlighted in cyan) the first example, while segmenting road (yellow), sky (cyan) in the second. For object detection, We localize the head and moon (both highlighted in cyan). Leveraging these conditions, we can reverse the process to recreate a variant of the input image based on the same caption. Additionally, we can edit the image by modifying specific elements, such as replacing the moon with Saturn (last example).
  • Figure 5: Illustration of the multiview generation with single input image. We equally slice the azimuth in range of $[-45, 60]$ and elevation in range of $[-15, 45]$ for the left scenes. For the right scene, the azimuth range is set to $[0; 360]$ and elevation range is set to $[-15; 15]$.
  • ...and 14 more figures