Table of Contents
Fetching ...

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang, Zuoxin Li, Haibin Huang, Xuelong Li

TL;DR

UniModel presents a fully vision-native multimodal framework that unifies understanding and generation by representing text and images in a shared pixel space. It introduces a Unified Diffusion Transformer trained with rectified-flow in the VAE latent space, with bidirectional capabilities managed by lightweight task embeddings. The core contributions are the representation-level unification via painted text, task-level unification through pixel-to-pixel mappings, and model-level unification with a single parameterization. Empirically, UniModel achieves competitive text-to-image synthesis and image-to-text understanding, demonstrating strong cross-modal alignment and cycle-consistent controllability, and suggesting a promising path toward general-purpose visual foundation models grounded in shared pixel representations.

Abstract

We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

TL;DR

UniModel presents a fully vision-native multimodal framework that unifies understanding and generation by representing text and images in a shared pixel space. It introduces a Unified Diffusion Transformer trained with rectified-flow in the VAE latent space, with bidirectional capabilities managed by lightweight task embeddings. The core contributions are the representation-level unification via painted text, task-level unification through pixel-to-pixel mappings, and model-level unification with a single parameterization. Empirically, UniModel achieves competitive text-to-image synthesis and image-to-text understanding, demonstrating strong cross-modal alignment and cycle-consistent controllability, and suggesting a promising path toward general-purpose visual foundation models grounded in shared pixel representations.

Abstract

We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

Paper Structure

This paper contains 18 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: Unified visual representation. (a) Given an RGB image, the model outputs its corresponding painted text image. (b) Using the painted text image as a condition, the model generates a realistic RGB image.
  • Figure 2: Unified bidirectional generation. (a) Image-to-text: the model encodes an input image into conditional states and generates its text-painted counterpart. (b) Text-to-Image: provided a painted text image, the model synthesizes a realistic RGB image using the same unified architecture.
  • Figure 3: Gallery of text-to-image generation results.
  • Figure 4: Gallery of image-to-textimage generation results.
  • Figure 5: Cycle inference results. Starting from an input image, the model generates a painted text description and then reconstructs an image from the painted text. The reconstructed image preserves the key semantics and visual attributes of the original, demonstrating strong bidirectional consistency in our unified visual representation.
  • ...and 1 more figures