Table of Contents
Fetching ...

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun

Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.
Paper Structure (19 sections, 3 equations, 5 figures, 8 tables)

This paper contains 19 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Cheers Capabilities. (a) Performance on general understanding and generation benchmarks compared with unified multimodal models (UMMs) of similar scale. (b) Generated image samples of Cheers.
  • Figure 2: Architectural comparison between prior UMMs and Cheers. (a) Separated visual spaces for understanding and generation. (b) Single semantic-centric space with limited structural details. (c) Fused feature representation with potential interference. (d) Cheers (Ours): A unified vision tokenizer that integrates structural and semantic features to ensure stable semantic understanding while enhancing generative details.
  • Figure 3: Overview of Cheers, a unified framework for multimodal understanding and image generation. The Unified Vision Tokenizer converts visual inputs into semantic tokens that are jointly processed with text tokens by the LLM for understanding tasks, and detail tokens that serve as step-adaptive high-frequency injection into the CFM Head during generation. During generation, the CFM Head predicts a continuous-time velocity field in the latent space, enabling iterative sampling from Gaussian noise $\mathbf{z}_{0}$ to the terminal latent $\mathbf{z}_{1}$, which is finally decoded by the VAE decoder.
  • Figure 4: Overall training pipeline and the progression of the GenEval score. The curve above illustrates the GenEval score as a function of cumulative training steps.
  • Figure 5: (a) Heatmap of high frequency injection across different generation steps. (b) High frequency injection intensity at each step. The reported values are aggregated over multiple runs, with per-run normalization applied prior to averaging across samples at the same step, ensuring a faithful representation of the overall trend.