Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong; Jiaqi Gu; Yujing Lou; Lubin Fan; Yixiong Zou; Yue Wu; Jieping Ye; Ruixuan Li

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

TL;DR

This work addresses the limitation of multimodal LLMs that reason primarily in text and struggle with visual imagination. It introduces Sketch-in-Latents (SkiLa), a hybrid auto-regressive framework that interleaves textual thinking with latent visual thoughts, trained using a latent visual semantics reconstruction objective. By interleaving discrete text tokens and continuous latent sketches within a unified reasoning process, SkiLa achieves state-of-the-art performance on vision-centric tasks and demonstrates strong generalization to diverse multi-modal benchmarks, while avoiding external tools or pixel-level image generation during inference. The approach highlights a path toward intrinsically unified visual-text reasoning in MLLMs and opens avenues for task-adaptive, multi-step reasoning patterns.

Abstract

While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

TL;DR

Abstract

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)