Table of Contents
Fetching ...

Latent Poincaré Shaping for Agentic Reinforcement Learning

Hanchen Xia, Baoyou Chen, Zelin Zang, Yutang Ge, Guojiang Zhao, Siyu Zhu

TL;DR

LaPha presents a novel AlphaZero-like framework that transposes search, learning, and pruning onto a root-centered Poincaré latent space for agentic reasoning in LLMs. By defining geodesic distances and a potential-based shaping function in hyperbolic space, it converts sparse verifications into dense intermediate rewards and trains a lightweight value head to guide test-time MCTS. The approach achieves strong gains across math benchmarks and model scales, with notable improvements from self-guided search and latent-space pruning that preserve diversity and enable scalable inference. This work demonstrates the practical value of negative-curvature latent geometry for structured reasoning, offering a unified view of post-training via latent preference shaping and enabling efficient, scalable reasoning with LLMs.

Abstract

We propose LaPha, a method for training AlphaZero-like LLM agents in a Poincaré latent space. Under LaPha, the search process can be visualized as a tree rooted at the prompt and growing outward from the origin toward the boundary of the Poincaré ball, where negative curvature provides exponentially increasing capacity with radius. Using hyperbolic geodesic distance to rule-verified correctness, we define a node potential and assign dense process rewards by potential differences. We further attach a lightweight value head on the same shared latent space, enabling self-guided test-time scaling with almost no additional overhead. On MATH-500, LaPha improves Qwen2.5-Math-1.5B from 66.0% to 88.2%. With value-head-guided search, LaPha-1.5B reaches 56.7% accuracy on AIME'24, and LaPha-7B further achieves 60.0% on AIME'24 and 53.3% on AIME'25.

Latent Poincaré Shaping for Agentic Reinforcement Learning

TL;DR

LaPha presents a novel AlphaZero-like framework that transposes search, learning, and pruning onto a root-centered Poincaré latent space for agentic reasoning in LLMs. By defining geodesic distances and a potential-based shaping function in hyperbolic space, it converts sparse verifications into dense intermediate rewards and trains a lightweight value head to guide test-time MCTS. The approach achieves strong gains across math benchmarks and model scales, with notable improvements from self-guided search and latent-space pruning that preserve diversity and enable scalable inference. This work demonstrates the practical value of negative-curvature latent geometry for structured reasoning, offering a unified view of post-training via latent preference shaping and enabling efficient, scalable reasoning with LLMs.

Abstract

We propose LaPha, a method for training AlphaZero-like LLM agents in a Poincaré latent space. Under LaPha, the search process can be visualized as a tree rooted at the prompt and growing outward from the origin toward the boundary of the Poincaré ball, where negative curvature provides exponentially increasing capacity with radius. Using hyperbolic geodesic distance to rule-verified correctness, we define a node potential and assign dense process rewards by potential differences. We further attach a lightweight value head on the same shared latent space, enabling self-guided test-time scaling with almost no additional overhead. On MATH-500, LaPha improves Qwen2.5-Math-1.5B from 66.0% to 88.2%. With value-head-guided search, LaPha-1.5B reaches 56.7% accuracy on AIME'24, and LaPha-7B further achieves 60.0% on AIME'24 and 53.3% on AIME'25.
Paper Structure (30 sections, 24 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 24 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of LaPha. (a) In the Poincaré disk model of hyperbolic geometry, straight lines are geodesics—Euclidean circular arcs orthogonal to the unit circle (or diameters); the figure illustrates a family of hyperbolic parallel lines that share an ideal boundary point. (b) A hyperbolic tiling in the Poincaré disk, visualizing the rapid expansion of effective capacity with radius in negative curvature. (c) LaPha mean-pools the backbone hidden states to obtain a node representation, maps it into a prompt-centered Poincaré latent state, and reuses this shared latent space for geodesic potential shaping, value estimation, and latent-space pruning in search.
  • Figure 2: Left-top: Geometry ablation on the same rollout tree: node potentials $V(i)$ computed from the shared root-centered latents using Poincaré geodesic distance (disk view) versus Euclidean distance (ambient-space view). Left-bottom: During training, the value head’s top-1 selection (Pass@1) increasingly outperforms the average correctness among terminal leaves (Average Acc.). Middle: A guided rollout visualized in the prompt-centered Poincaré disk, colored by the geodesic potential $V(i)$. Right: The same rollout colored by the value head prediction $f_\theta(\bar{\mathbf{h}}_i)$. The red box marks the root (prompt) state.
  • Figure 3: Test-time scaling with value-guided MCTS on AIME'24 and MATH-500. Increasing the number of simulations consistently improves accuracy, with large gains at small budgets and diminishing returns at larger budgets. On AIME'24, accuracy rises from about $30\%$ (1 simulation) to about $56\%$ (128 simulations); on MATH-500 it increases from about $75\%$ to about $88\%$. Horizontal reference lines are included for comparison.
  • Figure 4: Effect of latent-space pruning. Left: training loss. Right: AIME'24 validation score.
  • Figure 5: Rollout Trees
  • ...and 1 more figures