Table of Contents
Fetching ...

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

TL;DR

This work addresses the limitation of multimodal LLMs that reason primarily in text and struggle with visual imagination. It introduces Sketch-in-Latents (SkiLa), a hybrid auto-regressive framework that interleaves textual thinking with latent visual thoughts, trained using a latent visual semantics reconstruction objective. By interleaving discrete text tokens and continuous latent sketches within a unified reasoning process, SkiLa achieves state-of-the-art performance on vision-centric tasks and demonstrates strong generalization to diverse multi-modal benchmarks, while avoiding external tools or pixel-level image generation during inference. The approach highlights a path toward intrinsically unified visual-text reasoning in MLLMs and opens avenues for task-adaptive, multi-step reasoning patterns.

Abstract

While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

TL;DR

This work addresses the limitation of multimodal LLMs that reason primarily in text and struggle with visual imagination. It introduces Sketch-in-Latents (SkiLa), a hybrid auto-regressive framework that interleaves textual thinking with latent visual thoughts, trained using a latent visual semantics reconstruction objective. By interleaving discrete text tokens and continuous latent sketches within a unified reasoning process, SkiLa achieves state-of-the-art performance on vision-centric tasks and demonstrates strong generalization to diverse multi-modal benchmarks, while avoiding external tools or pixel-level image generation during inference. The approach highlights a path toward intrinsically unified visual-text reasoning in MLLMs and opens avenues for task-adaptive, multi-step reasoning patterns.

Abstract

While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

Paper Structure

This paper contains 23 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: SkiLa empowers MLLMs to think in a unified manner through a hybrid auto-regressive process, generating multi-step, interleaved multi-modal reasoning traces, enabling the model to flexibly think with unpredefined visual-text imagination and interactions. It dynamically alternates between the textual thinking mode to generate textual thoughts and the visual sketching mode to generate latent sketch tokens as visual thoughts, effectively solving challenging tasks that current linguistic-CoT-based methods fail.
  • Figure 2: The training and inference of SkiLa. It dynamically alternates between textual thinking mode to generate textual thoughts and visual sketching mode to generate latent sketch tokens as visual thoughts. During training, the latent visual semantics reconstruction mechanism leverages a sketch module (encoder and projector), used exclusively during training, to extract visual embeddings from sketch images as reconstruction targets, ensuring the latent sketch tokens are semantically grounded.
  • Figure 3: The format structure of the SkiLa training sample.
  • Figure 4: Test on MME-RealWorld-Lite. (Left) Impact of reconstructed sketch visual token count on model performance. (Middle and Right) Impact of the latent sketch reconstruction loss weight with 9 and 27 sketch tokens on model performance.
  • Figure 5: An example of spatial imagination. SkiLa generates visual thoughts to imagine the 3D object from a 2D pattern.
  • ...and 6 more figures