Table of Contents
Fetching ...

AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking in Large Language Models

Xiangqi Wang, Yue Huang, Yanbo Wang, Xiaonan Luo, Kehan Guo, Yujun Zhou, Xiangliang Zhang

TL;DR

AdaReasoner tackles the problem that LLM reasoning performance is highly sensitive to configuration by introducing an RL-based, LLM-agnostic plugin that adaptively selects three hyperparameters per task: instruction format, temperature, and the number of reasoning steps. The method uses a factorized action space with three heads, guided by a pretrained reward model and Boltzmann exploration, and trains with REINFORCE in a few-shot regime, yielding sublinear regret $R(K) \le O(\sqrt{K|\mathcal{A}|\ln|\mathcal{A}|})$ and an $L$-smooth objective with bound $\frac{2\bigl(J(\Theta^*)-J(\Theta_0)\bigr)}{\eta K}+L\eta\sigma^2$. Empirically, AdaReasoner outperforms fixed prompting baselines across six LLMs and diverse tasks, including out-of-distribution and knowledge-intensive settings, and shows rapid few-shot convergence (roughly 50–100 demonstrations suffice). The work highlights practical gains in adaptive reasoning, while acknowledging limitations from discrete action spaces and RL overhead, with future directions toward continuous action spaces and gradient-based prompt optimization.

Abstract

LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.

AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking in Large Language Models

TL;DR

AdaReasoner tackles the problem that LLM reasoning performance is highly sensitive to configuration by introducing an RL-based, LLM-agnostic plugin that adaptively selects three hyperparameters per task: instruction format, temperature, and the number of reasoning steps. The method uses a factorized action space with three heads, guided by a pretrained reward model and Boltzmann exploration, and trains with REINFORCE in a few-shot regime, yielding sublinear regret and an -smooth objective with bound . Empirically, AdaReasoner outperforms fixed prompting baselines across six LLMs and diverse tasks, including out-of-distribution and knowledge-intensive settings, and shows rapid few-shot convergence (roughly 50–100 demonstrations suffice). The work highlights practical gains in adaptive reasoning, while acknowledging limitations from discrete action spaces and RL overhead, with future directions toward continuous action spaces and gradient-based prompt optimization.

Abstract

LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.

Paper Structure

This paper contains 24 sections, 1 theorem, 19 equations, 14 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Under the smoothness property of the objective function and bounded gradient variance, if running stochastic gradient descent (SGD) with constant step size $0<\eta\le 1/L$ for $K$ iterations, then the following bound holds for the average squared gradient: where $J(\Theta^*)=\max_\Theta J(\Theta)$.

Figures (14)

  • Figure 1: Performance of different CoT settings on the metaphor dataset tong2024metaphor. The default temperature is 0.1 if not specified.
  • Figure 2: The proposed framework of using AdaReasoner for automating the reasoning configurations (instructions, steps, temperature). During training, configurations actions are sampled with Boltzmann exploration, guiding LLMs to generate answers, which are then evaluated by a reward model for policy optimization.
  • Figure 3: Few-shot training performance.
  • Figure 4: Performance of different reasoning methods on knowledge intensive datasets (accuracy in %) by Llama-3.3-70B-Instruct.
  • Figure 5: The distribution of question length per dataset.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1: Nonconvex SGD Convergence
  • proof