Table of Contents
Fetching ...

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong

TL;DR

<3-5 sentence high-level summary> OneCAT presents a pure decoder-only unified multimodal model that eliminates external vision encoders and tokenizers at inference by employing a modality-specific Mixture-of-Experts and a scale-aware autoregressive generation mechanism. It unifies multimodal understanding, generation, and editing within a single architecture, integrating next-token and next-scale prediction through a scale-aware adapter to achieve fast, high-quality high-resolution outputs. The paper details a three-stage training pipeline (multimodal pretraining, unified mid-training, and unified SFT) and a data setup that emphasizes efficiency and cross-modal alignment. Experimental results show state-of-the-art performance among encoder-free and many unified models across understanding, generation, and editing tasks, along with substantial inference-time advantages over diffusion-based methods.</file>

Abstract

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

TL;DR

<3-5 sentence high-level summary> OneCAT presents a pure decoder-only unified multimodal model that eliminates external vision encoders and tokenizers at inference by employing a modality-specific Mixture-of-Experts and a scale-aware autoregressive generation mechanism. It unifies multimodal understanding, generation, and editing within a single architecture, integrating next-token and next-scale prediction through a scale-aware adapter to achieve fast, high-quality high-resolution outputs. The paper details a three-stage training pipeline (multimodal pretraining, unified mid-training, and unified SFT) and a data setup that emphasizes efficiency and cross-modal alignment. Experimental results show state-of-the-art performance among encoder-free and many unified models across understanding, generation, and editing tasks, along with substantial inference-time advantages over diffusion-based methods.</file>

Abstract

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

Paper Structure

This paper contains 41 sections, 6 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Showcase of the text-to-image generation abilities of the OneCAT model.
  • Figure 2: Showcase of the image editing abilities of the OneCAT model, including general image editing tasks such as object removal, background adjustment, color adjustment, subject replacement, and style transfer; as well as perceptual tasks including depth estimation, pose estimation, object segmentation, and Canny edge detection.
  • Figure 3: Inference pipeline of OneCAT, a decoder-only autoregressive unified model that seamlessly supports multimodal understanding, image generation and image editing.
  • Figure 4: Multimodal versatile attention mechanism. $T$ denotes the text tokens. $U$ denotes the continuous visual tokens for multimodal understanding or reference image tokens for image editing. $G_i$ denotes the $i$-th scale discrete visual tokens for visual generation.
  • Figure 5: Overview of the training pipeline. In Stage 1, we first prepare a teacher model by training a two-layer MLP to connect InternViT internvit and the Qwen2.5 LLM qwen2.5. This teacher model is then used to perform understanding distillation for the Und. FFN and the Patch Embedding layer. Simultaneously, we perform generation pretraining to optimize the Gen. FFN. All other parameters of the LLM remain frozen to preserve its pretrained language capabilities. In Stage 2 and 3, the entire model is unfrozen to conduct unified mid-training and supervised fine-tuning (SFT), respectively. The VAE component for visual generation is omitted from the figure for clarity.
  • ...and 10 more figures