Table of Contents
Fetching ...

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, Dong Yu

TL;DR

DOTS tackles the rigidity of static reasoning prompts by learning to plan optimal reasoning action trajectories for each question and LLM. It defines atomic action modules across analysis, solution, and verification layers, then uses iterative search to identify trajectories tailored to the solver’s capabilities, followed by supervised fine-tuning of planners (external or internal) to predict these trajectories for new questions. Empirical results across eight tasks show DOTS outperforms static prompting and vanilla SFT, with robust out-of-distribution generalization and data-efficient learning. The approach reveals that LLMs can allocate deeper computation to harder problems and adapt reasoning strategies to both question type and model strengths, offering a flexible, scalable path to improved LLM reasoning in diverse settings.

Abstract

Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called "reasoning actions"), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

TL;DR

DOTS tackles the rigidity of static reasoning prompts by learning to plan optimal reasoning action trajectories for each question and LLM. It defines atomic action modules across analysis, solution, and verification layers, then uses iterative search to identify trajectories tailored to the solver’s capabilities, followed by supervised fine-tuning of planners (external or internal) to predict these trajectories for new questions. Empirical results across eight tasks show DOTS outperforms static prompting and vanilla SFT, with robust out-of-distribution generalization and data-efficient learning. The approach reveals that LLMs can allocate deeper computation to harder problems and adapt reasoning strategies to both question type and model strengths, offering a flexible, scalable path to improved LLM reasoning in diverse settings.

Abstract

Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called "reasoning actions"), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason dynamically via optimal reasoning trajectory search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.
Paper Structure (41 sections, 6 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 6 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: A comparison of different paradigms of LLM reasoning. Unlike prior approaches with predefined, static reasoning actions, Dots dynamically plans for the optimal reasoning trajectory per each question and the specific task-solving LLM ($LLM_s$). In particular, Dots encompasses two inference setups, i.e., external planner tuning (c) and internalized planner tuning (d), depending on whether to introduce an external LLM as a planner ($LLM_p$) or to internalize the trajectory planning capability into the same solver LLM ($LLM_s$). (: tunable; : frozen)
  • Figure 2: The training process of Dots, including searching for the optimal reasoning trajectories for questions in the training set and fine-tuning the internalized/external planner LLM.
  • Figure 3: Average reasoning trajectory length per difficulty level on MATH for Dots (solver: GPT-4o-mini; External planner: Llama3-8B-Instruct).