Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Hengyuan Hu; Tingchen Fu; Minqi Jiang; Alexander H Miller; Yoram Bachrach; Jakob Nicolaus Foerster

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Hengyuan Hu, Tingchen Fu, Minqi Jiang, Alexander H Miller, Yoram Bachrach, Jakob Nicolaus Foerster

TL;DR

It is shown that good stepping stone questions exist and are transferrable, meaning that good questions can be generated, and they substantially help LLMs of various capabilities in solving the target tasks.

Abstract

Recent years have witnessed tremendous progress in enabling LLMs to solve complex reasoning tasks such as math and coding. As we start to apply LLMs to harder tasks that they may not be able to solve in one shot, it is worth paying attention to their ability to construct intermediate stepping stones that prepare them to better solve the tasks. Examples of stepping stones include simplifications, alternative framings, or subproblems. We study properties and benefits of stepping stones in the context of modern reasoning LLMs via ARQ (\textbf{A}king the \textbf{R}ight \textbf{Q}uestions), our simple framework which introduces a question generator to the default reasoning pipeline. We first show that good stepping stone questions exist and are transferrable, meaning that good questions can be generated, and they substantially help LLMs of various capabilities in solving the target tasks. We next frame stepping stone generation as a post-training task and show that we can fine-tune LLMs to generate more useful stepping stones by SFT and RL on synthetic data.

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 17 figures, 2 tables)

This paper contains 22 sections, 1 equation, 17 figures, 2 tables.

Introduction
Related Work
Post-training LLMs for Reasoning
Inference-time Scaffolds for Reasoning
Synthetic Task & Data Generation for LLMs
Method
ARQ as an Inference-Time Approach
ARQ as a Post-training Task
Experiments and Results
Performance of ARQ with Off-the-shelf LLMs
Benefits of Good Stepping Stones
Transferability of Good Stepping Stones
Post-training LLMs to Generate Stepping Stones
Extending ARQ to Multiple Stones
Conclusion
...and 7 more sections

Figures (17)

Figure 1: Illustration of ARQ. Instead of directly solving a task that may be too hard for a given solver (top), ARQ (bottom) adds a question-asking step that prompts or trains LLMs to generate stepping stone questions which, once solved, provide guidance or inspiration for the original problem. The LLMs used to generate stepping stone questions are referred to as stepping stone generators. In this example, the stepping stone (generated by an LLM) focuses on a special case of the original problem.
Figure 2: Performance of ARQ and baselines on three math benchmarks using off-the-shelf LLMs. With modern reasoning LLMs, prior prompting-based approaches that include a problem-generation (Analogical) or a decomposition (Least-to-Most) component hurt performance. ARQ performs better than the solver only baseline only when the stone generator uses a more powerful reasoning model as in BeyondAIME. The weaker LLM is not able to generate helpful stepping stones in expectation.
Figure 3: The performance of ARQ variants conditioned on the best stepping stone. Existing LLMs can generate highly useful stepping stones that increase the success rate on the target problems, and LLMs with better reasoning capabilities generate better stepping stones. In contrast, the best stone from Rand does not improve over the Solver only baseline (horizontal dashed lines). To avoid selection bias, we use half of the rollouts to select the best stones while we report the score using the other half.
Figure 4: Scores of individual stones on selected questions from AIME 2024. Each point represents the score of the target problem conditioned on a generated stone, and the short horizontal bar indicates the average score over all stepping stones. The horizontal dashed line is the average performance of the Solver only baseline. ARQ generates stepping stones of varying quality. The best ones are highly beneficial, but some are detrimental and reduce the average improvement. ARQ with the more capable reasoning LLM is able to generate better stones, both in terms of the best and the average.
Figure 5: Transferability results of good stepping stones. Performance remains roughly the same regardless of whether the best stones are selected by the current solver itself or by a reference solver that has a different size and reasoning capacity, indicating that good stepping stones are universally helpful across solvers. The Solver only scores of the new solvers are shown as dashed lines.
...and 12 more figures

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

TL;DR

Abstract

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Authors

TL;DR

Abstract

Table of Contents

Figures (17)