Table of Contents
Fetching ...

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang

TL;DR

This paper tackles the challenge of multi-step decision-making in LLM agents by introducing step-level Q-value estimation learned from Monte Carlo Tree Search trajectories and trained via step-level Direct Policy Optimization. At inference, agents select actions with the highest estimated Q-value at each decision step, effectively propagating credit to earlier choices and mitigating sparse terminal rewards. The approach is validated across WebShop and HotPotQA using both open-source and API-based LLMs, showing substantial performance gains that generalize across backbones and prompting strategies, and proving more data- and compute-efficient than backbone fine-tuning. The results demonstrate that plug-and-play Q-value guidance can robustly improve planning and decision-making in diverse environments without altering the underlying LLM backbones, offering practical benefits for real-world agent systems. Overall, the work provides a flexible framework to augment LLM agents with task-relevant value estimates, enabling better decisions with limited additional training data and without compromising backbone generality.

Abstract

Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

TL;DR

This paper tackles the challenge of multi-step decision-making in LLM agents by introducing step-level Q-value estimation learned from Monte Carlo Tree Search trajectories and trained via step-level Direct Policy Optimization. At inference, agents select actions with the highest estimated Q-value at each decision step, effectively propagating credit to earlier choices and mitigating sparse terminal rewards. The approach is validated across WebShop and HotPotQA using both open-source and API-based LLMs, showing substantial performance gains that generalize across backbones and prompting strategies, and proving more data- and compute-efficient than backbone fine-tuning. The results demonstrate that plug-and-play Q-value guidance can robustly improve planning and decision-making in diverse environments without altering the underlying LLM backbones, offering practical benefits for real-world agent systems. Overall, the work provides a flexible framework to augment LLM agents with task-relevant value estimates, enabling better decisions with limited additional training data and without compromising backbone generality.

Abstract

Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.
Paper Structure (37 sections, 10 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 37 sections, 10 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of our method. To train the Q-value model, LLM agents interact with the environment to collect preference data with Q-value annotations using MCTS. During inference, LLM agents sample multiple candidate actions and select the best one based on the Q-value model.
  • Figure 2: Cases of GPT-4o-mini agent on WebShop. We analyze the second step of the decision-making process, where the attributes "women," "anti-slip," and "price" should be prioritized over the "black" attribute. The value of these actions is task-relevant and challenging for LLM agents to estimate. An external Q-value model can guide action selection to enhance decision-making. For further details, please refer to Appendix \ref{['appendix:webshop-case']}.
  • Figure 3: Collecting step-level preference data involves two stages: (a) using MCTS to explore high-quality trajectories and annotate each step with Q-values, and (b) constructing preference data from the final tree. During the construction stage, green nodes represent the best trajectories explored by the agent and are regarded as win nodes at each depth of the tree. Blue nodes are candidates for selecting lose actions, while gray nodes are neglected.
  • Figure 4: Evaluations of learned Q-value models. (a) In addition to the training and IND test datasets, we also evaluate accuracy on an OOD set, where the trajectories are sampled by the Llama-3.1-8B-instruct model. (b) We visualize the Q values of 200 actions sampled by the Phi-3-mini-4k-instruct agent, given the instructions in the test set of WebShop.
  • Figure 5: Ablations of training samples and collection of preference data.
  • ...and 4 more figures