Table of Contents
Fetching ...

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng

Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

Paper Structure

This paper contains 37 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Latent-space unified models enable interleaved cross-modal reasoning through shared semantic visual representations.Left: LatentUM improves text-to-image generation via self-reflection over its own generated semantic visual tokens. Middle: LatentUM interleaves textual reasoning with latent visual state updates for visual spatial planning. Right: LatentUM supports world modeling by predicting future visual states as semantic tokens conditioned on actions.
  • Figure 2: Overview of LatentUM. LatentUM unifies visual understanding and generation within a shared semantic latent space, enabling cross-modal reasoning without pixel-space mediation. (a) The visual tokenizer $\mathcal{Q}_\phi$ is trained via model behavior aligned quantization (MBAQ): it minimizes the KL divergence between the VLM's output distributions on original features $\mathbf{V}$ versus quantized features $\mathcal{Q}_\phi(\mathbf{V})$, preserving understanding-oriented semantics rather than pixel details. (b) Mixture-of-Modal Experts (MoME): the transformer maintains two parallel branches, where $\psi$ handles understanding and $\theta$ handles generation. The two branches share one self-attention mechanism. Generated visual codes are de-quantized and recached in the same context, allowing the model to reason over its own outputs. (c) Decoupled pixel decoder: a diffusion decoder $\boldsymbol{\epsilon}_\eta$ optionally maps quantized semantic features to pixels for visualization. It is trained independently, so the core model never optimizes for pixel fidelity.
  • Figure 3: Pixel reconstruction comparison with VQVAE. VQVAE preserves low-level details (e.g., arrow styles) but loses semantic content (e.g., sign text). Our quantizer retains semantics while discarding non-essential pixel details.
  • Figure 4: Training for multi-frame interleaved reasoning. Visual tokens are processed by both MoME branches with specific attention mask, enabling all visual states to be trained in a single forward pass.
  • Figure 5: Text-to-image gallery of LatentUM. The last column highlights LatentUM's text rendering capability, which emerges as visual and language tokens share a unified semantic space, enabling legible in-image text.
  • ...and 8 more figures