Table of Contents
Fetching ...

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Yaqi Zhao, Wang Lin, Zijian Zhang, Miles Yang, Jingyuan Chen, Wentao Zhang, Zhao Zhong, Liefeng Bo

TL;DR

UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation, is introduced and it is empirically demonstrated that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation.

Abstract

Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

TL;DR

UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation, is introduced and it is empirically demonstrated that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation.

Abstract

Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.
Paper Structure (36 sections, 5 equations, 16 figures, 7 tables)

This paper contains 36 sections, 5 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: High-quality samples generated by UniCom. Built on compressed continuous representations, our unified multimodal model demonstrates exceptional capabilities in text-to-image generation, precise image editing, and fine-grained controllable generation.
  • Figure 2: Overview of the proposed framework. For a controlled comparison, both pathways are built upon the same compressed representations and jointly optimized with cross-entropy loss ($\mathcal{L}_{ce}$) and flow matching loss ($\mathcal{L}_{fm}$).
  • Figure 3: Comparison of the results for image editing, highlighting UniCom performance in tasks such as image manipulation, object swapping, and color adjustment. See more visualization in the Appendix.
  • Figure 4: Overview of the proposed diffusion decoder and reconstruction analysis. Given an input image, the vision encoder extracts semantic features which are then processed by compressor and decompressor to condition the diffusion model for image reconstruction. During training, the compression modules and the diffusion backbone are optimized, while the encoder remains frozen. We investigate the bottleneck capacity by varying the token number $n$ and feature dimension $d$. As illustrated in the bottom row, compressing the dimension $d$ yields superior reconstruction fidelity, whereas reducing the token number $n$ leads to noticeable blurring in fine details.
  • Figure 5: Visual comparison of image reconstruction results. Notably, our method (d64) maintains high fidelity in high-frequency details (e.g., text characters) and preserves facial identity better than these semantic-based baselines, achieving quality comparable to the specialized Flux.1-dev VAE.
  • ...and 11 more figures