Table of Contents
Fetching ...

OneLatent: Single-Token Compression for Visual Latent Reasoning

Bo Lv, Yasheng Sun, Junjie Wang, Haoxiang Shi

TL;DR

This work tackles the high computational and memory costs of explicit chain-of-thought reasoning in large language models by proposing OneLatent, a framework that compresses intermediate reasoning into a single visual latent token. A three-stage curriculum (CoT Cold Start, OneLatent Alignment, Focus Fine-tuning) uses rendered CoT images and DeepSeek-OCR–derived hidden states as supervision to align a lone latent token with internal reasoning, while the online interface remains text-only at inference. Empirically, OneLatent achieves up to $87.4\times$ compression on long-chain tasks with only $2.21\%$ macro-averaged accuracy loss and a $6.8\times$ improvement in Output Token Contribution (OTC), demonstrating promising compression-constrained generalization. The approach reduces output length dramatically (from about $74.6$ tokens to roughly $6.8$ tokens on average) and lowers KV-cache growth, offering substantial speedups for deployment under strict token budgets, while maintaining strong performance on ProntoQA and ProsQA with a single latent token.

Abstract

Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by $11\times$ with only a $2.21\%$ average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by $6.8\times$. On long-chain logical reasoning, OneLatent reaches $99.80\%$ on ProntoQA and $97.80\%$ on ProsQA with one latent token, with compression up to $87.4\times$, supporting compression-constrained generalization.

OneLatent: Single-Token Compression for Visual Latent Reasoning

TL;DR

This work tackles the high computational and memory costs of explicit chain-of-thought reasoning in large language models by proposing OneLatent, a framework that compresses intermediate reasoning into a single visual latent token. A three-stage curriculum (CoT Cold Start, OneLatent Alignment, Focus Fine-tuning) uses rendered CoT images and DeepSeek-OCR–derived hidden states as supervision to align a lone latent token with internal reasoning, while the online interface remains text-only at inference. Empirically, OneLatent achieves up to compression on long-chain tasks with only macro-averaged accuracy loss and a improvement in Output Token Contribution (OTC), demonstrating promising compression-constrained generalization. The approach reduces output length dramatically (from about tokens to roughly tokens on average) and lowers KV-cache growth, offering substantial speedups for deployment under strict token budgets, while maintaining strong performance on ProntoQA and ProsQA with a single latent token.

Abstract

Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by with only a average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by . On long-chain logical reasoning, OneLatent reaches on ProntoQA and on ProsQA with one latent token, with compression up to , supporting compression-constrained generalization.
Paper Structure (43 sections, 12 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 43 sections, 12 equations, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: Reasoning interfaces and efficiency plot. (a) We illustrate four reasoning interfaces and their output forms: No CoT, textual CoT, iCoT, COCONUT, and OneLatent. (b) On GSM8K, we plot OTC (%/token) versus the average number of generated output tokens for the same methods. OneLatent appears in the lowest-output-length region and keeps high OTC, indicating substantially shorter outputs than the compared reasoning interfaces.
  • Figure 2: Three-stage training pipeline of OneLatent. Stage 1 trains explicit CoT generation with cross-entropy. Stage 2 replaces CoT with one latent token and adds MSE alignment to pre-extracted visual hidden-state targets. Stage 3 keeps the one-latent interface and trains answer generation without alignment loss. The figure shows the supervision shift from explicit textual CoT to latent alignment and then to answer-only decoding.
  • Figure 3: Data preparation pipeline for latent supervision. (a) CoT text is rendered into a fixed-size image with controlled layout parameters. Middle: the rendered image is encoded by frozen visual modules and the LLM backbone. (b) a hidden-state target is extracted and stored for Stage $2$ supervision. The figure depicts one offline target vector produced from each CoT sample.