Table of Contents
Fetching ...

Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang

TL;DR

Monet tackles the challenge of visual reasoning in latent spaces by training multimodal LLMs to generate continuous latent embeddings as intermediate visual thoughts. It introduces Monet-SFT, a three-stage supervised fine-tuning pipeline, and VLPO, a Visual-latent Policy Optimization RL method that directly optimizes latent embeddings with task rewards. A high-quality Monet-SFT-125K dataset enables effective supervision of latent reasoning without excessive latent–image alignment costs. Empirical results show consistent gains on perception and reasoning benchmarks and strong out-of-distribution performance on abstract tasks, with ablations clarifying the importance of dual supervision and latent-only backpropagation. The work also discusses limitations such as training complexity and opens avenues for reward-design refinements in visual latent reasoning.

Abstract

"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

Monet: Reasoning in Latent Visual Space Beyond Images and Language

TL;DR

Monet tackles the challenge of visual reasoning in latent spaces by training multimodal LLMs to generate continuous latent embeddings as intermediate visual thoughts. It introduces Monet-SFT, a three-stage supervised fine-tuning pipeline, and VLPO, a Visual-latent Policy Optimization RL method that directly optimizes latent embeddings with task rewards. A high-quality Monet-SFT-125K dataset enables effective supervision of latent reasoning without excessive latent–image alignment costs. Empirical results show consistent gains on perception and reasoning benchmarks and strong out-of-distribution performance on abstract tasks, with ablations clarifying the importance of dual supervision and latent-only backpropagation. The work also discusses limitations such as training complexity and opens avenues for reward-design refinements in visual latent reasoning.

Abstract

"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

Paper Structure

This paper contains 27 sections, 19 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Method overview.Left: During inference, Monet can automatically decide when to start latent reasoning by outputting a special start embedding. We fix the output length of the latent embeddings. Right: We propose a three-stage SFT (Section \ref{['sec:SFT']}) and RL (Section \ref{['sec:RL']}) framework. The SFT stages progressively warm up the model, generate high-quality latent embeddings, and distill latent reasoning ability. The RL stage further refines the model using our VLPO algorithm, specifically designed for latent reasoning.
  • Figure 2: Construction pipeline of Monet-SFT-125K. Stage 1 filters hard samples (unsolvable from the original image). Stage 2 keeps those where auxiliary images lead to correct answers, ensuring their necessity and correctness. Stage 3 highlights key visual-observation tokens using advanced LLM judges, providing strong supervision for learning latent embeddings.
  • Figure 3: The proposed three-stage SFT pipeline: warm-up, supervised latent–observation alignment with controlled attention flow, and latent generation without auxiliary-image access.
  • Figure 4: Prediction accuracy of the observation tokens during warm-up. Training on image–text interleaved data encourages the model to utilize intermediate visual cues.
  • Figure 5: Effect of the number of abstract visual embeddings used during training and inference on test accuracy. The dashed line marks the accuracy of Qwen2.5-VL-7B.
  • ...and 9 more figures