Table of Contents
Fetching ...

ORION: Teaching Language Models to Reason Efficiently in the Language of Thought

Kumar Tanmay, Kriti Aggarwal, Paul Pu Liang, Subhabrata Mukherjee

TL;DR

ORION tackles verbose reasoning in large reasoning models by introducing Mentalese, a compact symbolic reasoning language, paired with SLPO to adaptively balance brevity and correctness. A two-stage training pipeline—Stage 1 supervised finetuning on Mentalese traces and Stage 2 verifier-based RL with GRPO or SLPO (RLVR)—yields 10×–20× reductions in reasoning tokens while preserving accuracy. Empirical results across diverse math benchmarks show 4–16× token compression, up to 5× faster inference, and 7–9× lower training costs, with competitive or superior accuracy compared to strong baselines. The findings indicate that structured, symbolic reasoning can achieve human-like cognitive efficiency in real-time settings, supporting practical deployment of reasoning-capable AI agents.

Abstract

Large Reasoning Models (LRMs) achieve strong performance in mathematics, code generation, and task planning, but their reliance on long chains of verbose "thinking" tokens leads to high latency, redundancy, and incoherent reasoning paths. Inspired by the Language of Thought Hypothesis, which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese, we introduce a framework that trains models to reason in a similarly compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. To improve both efficiency and accuracy, we propose SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO), a reinforcement learning method that rewards concise solutions that stay correct, while still allowing longer reasoning when needed. Applied to Mentalese-aligned models, SLPO yields significantly higher compression rates by enabling concise reasoning that preserves the benefits of detailed thinking without the computational overhead. Across benchmarks including AIME 2024 and 2025, MinervaMath, OlympiadBench, Math500, and AMC, our ORION models produce reasoning traces with 4-16x fewer tokens, achieve up to 5x lower inference latency, and reduce training costs by 7-9x relative to the DeepSeek R1 Distilled model, while maintaining 90-98% of its accuracy. ORION also surpasses Claude and ChatGPT-4o by up to 5% in accuracy while maintaining 2x compression. These results show that Mentalese-style compressed reasoning offers a step toward human-like cognitive efficiency, enabling real-time, cost-effective reasoning without sacrificing accuracy.

ORION: Teaching Language Models to Reason Efficiently in the Language of Thought

TL;DR

ORION tackles verbose reasoning in large reasoning models by introducing Mentalese, a compact symbolic reasoning language, paired with SLPO to adaptively balance brevity and correctness. A two-stage training pipeline—Stage 1 supervised finetuning on Mentalese traces and Stage 2 verifier-based RL with GRPO or SLPO (RLVR)—yields 10×–20× reductions in reasoning tokens while preserving accuracy. Empirical results across diverse math benchmarks show 4–16× token compression, up to 5× faster inference, and 7–9× lower training costs, with competitive or superior accuracy compared to strong baselines. The findings indicate that structured, symbolic reasoning can achieve human-like cognitive efficiency in real-time settings, supporting practical deployment of reasoning-capable AI agents.

Abstract

Large Reasoning Models (LRMs) achieve strong performance in mathematics, code generation, and task planning, but their reliance on long chains of verbose "thinking" tokens leads to high latency, redundancy, and incoherent reasoning paths. Inspired by the Language of Thought Hypothesis, which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese, we introduce a framework that trains models to reason in a similarly compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. To improve both efficiency and accuracy, we propose SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO), a reinforcement learning method that rewards concise solutions that stay correct, while still allowing longer reasoning when needed. Applied to Mentalese-aligned models, SLPO yields significantly higher compression rates by enabling concise reasoning that preserves the benefits of detailed thinking without the computational overhead. Across benchmarks including AIME 2024 and 2025, MinervaMath, OlympiadBench, Math500, and AMC, our ORION models produce reasoning traces with 4-16x fewer tokens, achieve up to 5x lower inference latency, and reduce training costs by 7-9x relative to the DeepSeek R1 Distilled model, while maintaining 90-98% of its accuracy. ORION also surpasses Claude and ChatGPT-4o by up to 5% in accuracy while maintaining 2x compression. These results show that Mentalese-style compressed reasoning offers a step toward human-like cognitive efficiency, enabling real-time, cost-effective reasoning without sacrificing accuracy.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance–efficiency trade-offs of various model families across six mathematical reasoning benchmarks (including AIME2025). The dotted curve indicates the Pareto frontier, which illustrates the trade-off between higher compression rates and loss in accuracy. Our proposed method, combining Mentalese alignment with SLPO, consistently lies on this frontier, identifying an optimal operating point that achieves a balance between accuracy and efficiency.
  • Figure 2: Contrast between human and machine reasoning (response from DeepSeek-R1). While humans arrive at intuitive and concise solutions, LLMs often produce verbose and redundant reasoning chains even for simple problems. We bridge this gap by developing methods that encourage models to reason more like humans—clear, efficient, and direct—while preserving accuracy. Grounded in the Language of Thought hypothesis, human reasoning compresses complex ideas into minimal symbolic steps, reflecting cognitive efficiency. Emulating this compact reasoning reduces redundancy in machine outputs, improving both interpretability and token efficiency.
  • Figure 3: Illustration of symbolic, logic-based chain of thought (mentalese). This figure shows the definition (top), an example of symbolic reasoning steps (left) with rules governing the reasoning style (right).
  • Figure 4: Comparison of reasoning traces on AIME 2024. Agentica-24k model use approximately 7800 tokens, ORION-AG 150 tokens, and ORION-AG-SLPO 300 tokens, achieving similar accuracy.
  • Figure 5: This figure compares direct SLPO on the base model with Intermediate SFT followed by RLHF methods (SLPO/GRPO) on the MentaleseR-40k dataset across five metrics. The Mentalese alignment yields greater training stability and efficiency: (1) Response Length reveals direct SLPO collapses due to gradient instability, while ORION models stay stable; (2) Clip Ratio indicates more controlled updates in Mentalese methods, driven by reduced response truncation.; (3) Entropy Loss reflects better exploration–exploitation balance; (4) Training Time per RL Step shows higher computational efficiency; (5) Test Performance on AIME 2024 ($\sim$22% Pass@1) confirms ORION models outperform direct SLPO on the base model. Shaded regions denote min–max ranges across runs. These results highlight the importance of structured intermediate representations (Mentalese) for stable, efficient RL in large language models.
  • ...and 1 more figures