Table of Contents
Fetching ...

Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, Kenji Kawaguchi

TL;DR

SSR is proposed, a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals and yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance.

Abstract

Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy-execute-pipeline.

Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

TL;DR

SSR is proposed, a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals and yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance.

Abstract

Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to points on AIME25 and points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy-execute-pipeline.
Paper Structure (39 sections, 7 equations, 10 figures, 7 tables)

This paper contains 39 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An example illustrating that strategies which appear valid in isolation may fail when transferred as guidance. In this AIME-level problem, a human-derived structural strategy and a model-derived procedural strategy are each insufficient on their own, while selectively combining them enables successful execution.
  • Figure 2: Performance gains from Selective Strategy Retrieval (SSR) on closed-source reasoning models (GPT-4.1 and o3-mini), measured by pass@1 and averaged over five runs.
  • Figure 3: Normalized strategy usage in human-written and model-generated solutions, aggregated across problems with per-problem normalization. For each problem, strategies contribute equally, ensuring that multi-strategy solutions do not dominate the statistics.
  • Figure 4: Strategy-level divergence between human-written and model-generated solutions. (a) Normalized differences in strategy usage. (b) Normalized differences in strategy-guided accuracy.
  • Figure 5: Multi-route strategy retrieval in Selective Strategy Retrieval (SSR). Complementary retrieval routes capture category-level regularities, problem-specific transfer, and semantic coverage, forming the candidate set $\mathcal{S}(x)$.
  • ...and 5 more figures