Table of Contents
Fetching ...

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie, Mingming Cheng, Xiang Li

TL;DR

CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, without relying on auxiliary annotations or external modules, crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs

TL;DR

CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, without relying on auxiliary annotations or external modules, crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.
Paper Structure (33 sections, 6 equations, 16 figures, 5 tables)

This paper contains 33 sections, 6 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Previous Paradigms vs. Our Paradigm (CrystaL). (a) describes the paradigm of supervising visual latent tokens via predefined features from specific models, such as SAM kirillov2023segment, DINO caron2021emerging. (b) denotes the process of modifying original image to guide the reasoning steps. (c) While other methods use auxiliary modules or data to train the visual latent token, CrystaL can do it by self-supervising in a single-stage.
  • Figure 2: An overview of CrystaL. (a) Starting by corrupting the raw image, we construct two types of visual tokens from vision encoder. Then both the $S_{int}$ and $S_{cor}$ in Sec. \ref{['sec:3.3']} are fed into the model to compute the $\mathbf{P}_{int}$ and $\mathbf{P}_{cor}$. But the hidden states of corrput forward (dashed arrow) indeed come from the the intact forward (solid arrow). For the objective function, we adopt a combination of cross entropy loss and alignment loss, the attention map alignment is illustrated in (b).
  • Figure 3: Detailed illustration of visual latent token copying. For the forward process, the autoregressive hidden states of visual latent tokens in the corrupted path is copied from the intact path.
  • Figure 4: Comparison of performance and training data size. Our method achieves superior performance while utilizing significantly fewer samples than baselines, demonstrating exceptional data efficiency.
  • Figure 5: Direct Finetune vs CrystaL. CrystaL outperforms direct finetuning’capabilities across all the tasks.
  • ...and 11 more figures