Table of Contents
Fetching ...

Test-time Diverse Reasoning by Riemannian Activation Steering

Ly Tran Ho Khanh, Dongxuan Zhu, Man-Chung Yue, Viet Anh Nguyen

TL;DR

The paper tackles the problem of limited output diversity in Best-of-$N$ reasoning for language models by introducing SPREAD, a test-time activation steering method. It casts steering as a Riemannian optimization on the product of spheres to maximize the volume spanned by intervened hidden activations, leveraging a log-determinant objective and a block-coordinate descent algorithm with exponential maps. The authors prove convergence properties and provide practical initialization and hyperparameter strategies, demonstrating strong gains in diversity and solution accuracy on mathematical benchmarks (e.g., AIME24, MATH500, OlympiadBench) with scalable inference-time costs. Overall, SPREAD offers a lightweight, parameter-efficient approach to enhance reasoning diversity without fine-tuning, with potential implications for robust multi-path problem solving in LMs.

Abstract

Best-of-$N$ reasoning improves the accuracy of language models in solving complex tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria. A critical bottleneck for this strategy is the output diversity limit, which occurs when the model generates similar outputs despite stochastic sampling, and hence recites the same error. To address this lack of variance in reasoning paths, we propose a novel unsupervised activation steering strategy that simultaneously optimizes the steering vectors for multiple reasoning trajectories at test time. At any synchronization anchor along the batch generation process, we find the steering vectors that maximize the total volume spanned by all possible intervened activation subsets. We demonstrate that these steering vectors can be determined by solving a Riemannian optimization problem over the product of spheres with a log-determinant objective function. We then use a Riemannian block-coordinate descent algorithm with a well-tuned learning rate to obtain a stationary point of the problem, and we apply these steering vectors until the generation process reaches the subsequent synchronization anchor. Empirical evaluations on popular mathematical benchmarks demonstrate that our test-time Riemannian activation steering strategy outperforms vanilla sampling techniques in terms of generative diversity and solution accuracy.

Test-time Diverse Reasoning by Riemannian Activation Steering

TL;DR

The paper tackles the problem of limited output diversity in Best-of- reasoning for language models by introducing SPREAD, a test-time activation steering method. It casts steering as a Riemannian optimization on the product of spheres to maximize the volume spanned by intervened hidden activations, leveraging a log-determinant objective and a block-coordinate descent algorithm with exponential maps. The authors prove convergence properties and provide practical initialization and hyperparameter strategies, demonstrating strong gains in diversity and solution accuracy on mathematical benchmarks (e.g., AIME24, MATH500, OlympiadBench) with scalable inference-time costs. Overall, SPREAD offers a lightweight, parameter-efficient approach to enhance reasoning diversity without fine-tuning, with potential implications for robust multi-path problem solving in LMs.

Abstract

Best-of- reasoning improves the accuracy of language models in solving complex tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria. A critical bottleneck for this strategy is the output diversity limit, which occurs when the model generates similar outputs despite stochastic sampling, and hence recites the same error. To address this lack of variance in reasoning paths, we propose a novel unsupervised activation steering strategy that simultaneously optimizes the steering vectors for multiple reasoning trajectories at test time. At any synchronization anchor along the batch generation process, we find the steering vectors that maximize the total volume spanned by all possible intervened activation subsets. We demonstrate that these steering vectors can be determined by solving a Riemannian optimization problem over the product of spheres with a log-determinant objective function. We then use a Riemannian block-coordinate descent algorithm with a well-tuned learning rate to obtain a stationary point of the problem, and we apply these steering vectors until the generation process reaches the subsequent synchronization anchor. Empirical evaluations on popular mathematical benchmarks demonstrate that our test-time Riemannian activation steering strategy outperforms vanilla sampling techniques in terms of generative diversity and solution accuracy.

Paper Structure

This paper contains 19 sections, 8 theorems, 80 equations, 6 figures, 3 tables.

Key Result

Proposition 2

Problem eq:volume is equivalent to the following log-determinant optimization problem

Figures (6)

  • Figure 1: Overview of SPREAD for generating $N$ diverse reasoning answers simultaneously. At each decoding step $\tau_t$, we extract the hidden vectors corresponding to the last token in each path. These hidden vectors serve as inputs to Algorithm \ref{['alg:algorithm1']}, where they are projected into a shared activation space to compute $N$ steering vectors. This process is repeated until an end-of-sequence $EOS$ token is generated.
  • Figure 2: An illustration of the volume maximization intuition behind SPREAD. The hidden vectors $h_1$ and $h_2$ (originally blue squares) are pushed toward target positions using the corresponding steering vectors $v_1$, $v_2$, found via Riemannian Block Coordinate Descent (Algorithm \ref{['alg:algorithm1']}). After intervention, the new parallelepiped (red) has a larger volume than the original parallelepiped (dashed blue).
  • Figure 3: Comparison of SPREAD and sampling methods on AIME24 dataset using Qwen2.5-1.5B model, showing Pass@$N$ and Unique Solution Count.
  • Figure 4: Comparison of SPREAD and sampling methods on AIME24 dataset using Qwen2.5-Math-1.5B-Instruct model, showing Pass@$N$ and Unique Solution Count.
  • Figure 5: Average running time of Algorithm 1 with varying problem dimension $p$ and number of steering vectors $N$. Lower execution time is better.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Definition 1: Parallelepiped
  • Proposition 2: Objective function equivalence
  • Example 3: Non-convexity of $\ell$
  • Proposition 4: Constraint equivalence
  • Lemma 5
  • Theorem 6: Convergence of Algorithm \ref{['alg:RBCD with Exponential Maps']}
  • Definition 7: $L$-smoothness
  • Proposition 8: Smoothness
  • Proposition 9: Block smoothness
  • proof : Proof of Proposition \ref{['prop:obj']}
  • ...and 9 more