AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking in Large Language Models
Xiangqi Wang, Yue Huang, Yanbo Wang, Xiaonan Luo, Kehan Guo, Yujun Zhou, Xiangliang Zhang
TL;DR
AdaReasoner tackles the problem that LLM reasoning performance is highly sensitive to configuration by introducing an RL-based, LLM-agnostic plugin that adaptively selects three hyperparameters per task: instruction format, temperature, and the number of reasoning steps. The method uses a factorized action space with three heads, guided by a pretrained reward model and Boltzmann exploration, and trains with REINFORCE in a few-shot regime, yielding sublinear regret $R(K) \le O(\sqrt{K|\mathcal{A}|\ln|\mathcal{A}|})$ and an $L$-smooth objective with bound $\frac{2\bigl(J(\Theta^*)-J(\Theta_0)\bigr)}{\eta K}+L\eta\sigma^2$. Empirically, AdaReasoner outperforms fixed prompting baselines across six LLMs and diverse tasks, including out-of-distribution and knowledge-intensive settings, and shows rapid few-shot convergence (roughly 50–100 demonstrations suffice). The work highlights practical gains in adaptive reasoning, while acknowledging limitations from discrete action spaces and RL overhead, with future directions toward continuous action spaces and gradient-based prompt optimization.
Abstract
LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work 'well enough' across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.
