Table of Contents
Fetching ...

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang

TL;DR

QLASS tackles sparse, long-horizon rewards in language agents by introducing stepwise Q-value guidance. It builds exploration trees from self-generated trajectories, learns a QNet to predict per-step value, and uses this signal to steer inference with Q-guided generation. The approach reduces reliance on large annotated datasets while delivering strong gains across WebShop, ALFWorld, and SciWorld, including under limited supervision. By enabling more intelligent, stepwise decision making at inference time, QLASS offers a scalable, open-source alternative to heavier training-based self-improvement pipelines.

Abstract

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

TL;DR

QLASS tackles sparse, long-horizon rewards in language agents by introducing stepwise Q-value guidance. It builds exploration trees from self-generated trajectories, learns a QNet to predict per-step value, and uses this signal to steer inference with Q-guided generation. The approach reduces reliance on large annotated datasets while delivering strong gains across WebShop, ALFWorld, and SciWorld, including under limited supervision. By enabling more intelligent, stepwise decision making at inference time, QLASS offers a scalable, open-source alternative to heavier training-based self-improvement pipelines.

Abstract

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

Paper Structure

This paper contains 29 sections, 10 equations, 9 figures, 5 tables, 3 algorithms.

Figures (9)

  • Figure 1: $\text{Q}$LASS pipeline overview. $\text{Q}$LASS involves mainly four stages: 1) Supervised fine-tuning (SFT) on expert data. 2) Leverage SFT agent to explore the environment and construct an exploration tree for each task. After construction, estimate the Q-value of each tree node based on Equation \ref{['equation:update_q']}. 3) Train QNet on the estimated Q-values. 4) Use the trained QNet to provide inference guidance at each step.
  • Figure 2: Illustrative example of constructing a exploration tree. Grey nodes represent the branches with a zero outcome reward. Once the leaf node with a zero outcome reward is detected, a Stop expansion signal will be sent back to the first unexpanded node on the branch. Green nodes are on branches where zero outcome reward is not detected and can keep expanding.
  • Figure 3: $\text{Q}$LASS and Best-of-N under different search budgets. The x-axis represents the number of tokens consumed by the trajectories generated during inference averaged on all the tasks in each test set.
  • Figure 4: Self-training baselines. The three methods marked with diagonal stripes leverage different process reward modeling based on the same exploration trees constructed in Stage 2 to guide self-training data generation.
  • Figure 5: One example on the ALFWorld, the right is $\text{Q}$LASS and the left is the SFT baseline.
  • ...and 4 more figures