OneLatent: Single-Token Compression for Visual Latent Reasoning

Bo Lv; Yasheng Sun; Junjie Wang; Haoxiang Shi

OneLatent: Single-Token Compression for Visual Latent Reasoning

Bo Lv, Yasheng Sun, Junjie Wang, Haoxiang Shi

TL;DR

This work tackles the high computational and memory costs of explicit chain-of-thought reasoning in large language models by proposing OneLatent, a framework that compresses intermediate reasoning into a single visual latent token. A three-stage curriculum (CoT Cold Start, OneLatent Alignment, Focus Fine-tuning) uses rendered CoT images and DeepSeek-OCR–derived hidden states as supervision to align a lone latent token with internal reasoning, while the online interface remains text-only at inference. Empirically, OneLatent achieves up to $87.4\times$ compression on long-chain tasks with only $2.21\%$ macro-averaged accuracy loss and a $6.8\times$ improvement in Output Token Contribution (OTC), demonstrating promising compression-constrained generalization. The approach reduces output length dramatically (from about $74.6$ tokens to roughly $6.8$ tokens on average) and lowers KV-cache growth, offering substantial speedups for deployment under strict token budgets, while maintaining strong performance on ProntoQA and ProsQA with a single latent token.

Abstract

Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by $11\times$ with only a $2.21\%$ average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by $6.8\times$. On long-chain logical reasoning, OneLatent reaches $99.80\%$ on ProntoQA and $97.80\%$ on ProsQA with one latent token, with compression up to $87.4\times$, supporting compression-constrained generalization.

OneLatent: Single-Token Compression for Visual Latent Reasoning

TL;DR

compression on long-chain tasks with only

macro-averaged accuracy loss and a

improvement in Output Token Contribution (OTC), demonstrating promising compression-constrained generalization. The approach reduces output length dramatically (from about

tokens to roughly

tokens on average) and lowers KV-cache growth, offering substantial speedups for deployment under strict token budgets, while maintaining strong performance on ProntoQA and ProsQA with a single latent token.

Abstract

with only a

average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by

. On long-chain logical reasoning, OneLatent reaches

on ProntoQA and

on ProsQA with one latent token, with compression up to

, supporting compression-constrained generalization.

Paper Structure (43 sections, 12 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 43 sections, 12 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Latent reasoning methods.
Text-based reasoning methods.
Visual context compression.
OneLatent Method
Overview
Latent Segment and Continuous Filling
Data Preparation and CoT Rendering
Quality verification via DeepSeek-OCR.
Rendering algorithm.
Hidden-State Targets via DeepSeek-OCR Encoder
Text-only training and inference.
Patch alignment and OCR-friendly layout.
Three-Stage Training Strategy
...and 28 more sections

Figures (3)

Figure 1: Reasoning interfaces and efficiency plot. (a) We illustrate four reasoning interfaces and their output forms: No CoT, textual CoT, iCoT, COCONUT, and OneLatent. (b) On GSM8K, we plot OTC (%/token) versus the average number of generated output tokens for the same methods. OneLatent appears in the lowest-output-length region and keeps high OTC, indicating substantially shorter outputs than the compared reasoning interfaces.
Figure 2: Three-stage training pipeline of OneLatent. Stage 1 trains explicit CoT generation with cross-entropy. Stage 2 replaces CoT with one latent token and adds MSE alignment to pre-extracted visual hidden-state targets. Stage 3 keeps the one-latent interface and trains answer generation without alignment loss. The figure shows the supervision shift from explicit textual CoT to latent alignment and then to answer-only decoding.
Figure 3: Data preparation pipeline for latent supervision. (a) CoT text is rendered into a fixed-size image with controlled layout parameters. Middle: the rendered image is encoded by frozen visual modules and the LLM backbone. (b) a hidden-state target is extracted and stored for Stage $2$ supervision. The figure depicts one offline target vector produced from each CoT sample.

OneLatent: Single-Token Compression for Visual Latent Reasoning

TL;DR

Abstract

OneLatent: Single-Token Compression for Visual Latent Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)