OneLatent: Single-Token Compression for Visual Latent Reasoning
Bo Lv, Yasheng Sun, Junjie Wang, Haoxiang Shi
TL;DR
This work tackles the high computational and memory costs of explicit chain-of-thought reasoning in large language models by proposing OneLatent, a framework that compresses intermediate reasoning into a single visual latent token. A three-stage curriculum (CoT Cold Start, OneLatent Alignment, Focus Fine-tuning) uses rendered CoT images and DeepSeek-OCR–derived hidden states as supervision to align a lone latent token with internal reasoning, while the online interface remains text-only at inference. Empirically, OneLatent achieves up to $87.4\times$ compression on long-chain tasks with only $2.21\%$ macro-averaged accuracy loss and a $6.8\times$ improvement in Output Token Contribution (OTC), demonstrating promising compression-constrained generalization. The approach reduces output length dramatically (from about $74.6$ tokens to roughly $6.8$ tokens on average) and lowers KV-cache growth, offering substantial speedups for deployment under strict token budgets, while maintaining strong performance on ProntoQA and ProsQA with a single latent token.
Abstract
Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by $11\times$ with only a $2.21\%$ average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by $6.8\times$. On long-chain logical reasoning, OneLatent reaches $99.80\%$ on ProntoQA and $97.80\%$ on ProsQA with one latent token, with compression up to $87.4\times$, supporting compression-constrained generalization.
