Table of Contents
Fetching ...

TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision

Ruiwen Zhou, Yingxuan Yang, Muning Wen, Ying Wen, Wenhao Wang, Chunling Xi, Guoqiang Xu, Yong Yu, Weinan Zhang

TL;DR

A novel framework for large language model agents that not only surpasses state-of-the-art models but also effectively reduces noise and promotes generalization, and has been deployed in real-world scenarios of a global business insurance company and yields an improved success rate of robotic process automation.

Abstract

Numerous large language model (LLM) agents have been built for different tasks like web navigation and online shopping due to LLM's wide knowledge and text-understanding ability. Among these works, many of them utilize in-context examples to achieve generalization without the need for fine-tuning, while few of them have considered the problem of how to select and effectively utilize these examples. Recently, methods based on trajectory-level retrieval with task meta-data and using trajectories as in-context examples have been proposed to improve the agent's overall performance in some sequential decision making tasks. However, these methods can be problematic due to plausible examples retrieved without task-specific state transition dynamics and long input with plenty of irrelevant context. In this paper, we propose a novel framework (TRAD) to address these issues. TRAD first conducts Thought Retrieval, achieving step-level demonstration selection via thought matching, leading to more helpful demonstrations and less irrelevant input noise. Then, TRAD introduces Aligned Decision, complementing retrieved demonstration steps with their previous or subsequent steps, which enables tolerance for imperfect thought and provides a choice for balance between more context and less noise. Extensive experiments on ALFWorld and Mind2Web benchmarks show that TRAD not only outperforms state-of-the-art models but also effectively helps in reducing noise and promoting generalization. Furthermore, TRAD has been deployed in real-world scenarios of a global business insurance company and improves the success rate of robotic process automation.

TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision

TL;DR

A novel framework for large language model agents that not only surpasses state-of-the-art models but also effectively reduces noise and promotes generalization, and has been deployed in real-world scenarios of a global business insurance company and yields an improved success rate of robotic process automation.

Abstract

Numerous large language model (LLM) agents have been built for different tasks like web navigation and online shopping due to LLM's wide knowledge and text-understanding ability. Among these works, many of them utilize in-context examples to achieve generalization without the need for fine-tuning, while few of them have considered the problem of how to select and effectively utilize these examples. Recently, methods based on trajectory-level retrieval with task meta-data and using trajectories as in-context examples have been proposed to improve the agent's overall performance in some sequential decision making tasks. However, these methods can be problematic due to plausible examples retrieved without task-specific state transition dynamics and long input with plenty of irrelevant context. In this paper, we propose a novel framework (TRAD) to address these issues. TRAD first conducts Thought Retrieval, achieving step-level demonstration selection via thought matching, leading to more helpful demonstrations and less irrelevant input noise. Then, TRAD introduces Aligned Decision, complementing retrieved demonstration steps with their previous or subsequent steps, which enables tolerance for imperfect thought and provides a choice for balance between more context and less noise. Extensive experiments on ALFWorld and Mind2Web benchmarks show that TRAD not only outperforms state-of-the-art models but also effectively helps in reducing noise and promoting generalization. Furthermore, TRAD has been deployed in real-world scenarios of a global business insurance company and improves the success rate of robotic process automation.
Paper Structure (39 sections, 1 equation, 7 figures, 4 tables)

This paper contains 39 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An overall illustration of TRAD agent (on ALFWorld shridhar2021alfworld enviroment). TRAD first pre-processes expert trajectories, labeling each step with high-quality thoughts. At inference time, TRAD first conducts thought retrieval, which generates thought with trajectory-wise retrieved demonstrations as the query and keys for a more precise step-wise demonstration retrieval. Given the retrieved steps, TRAD employs aligned decision module to complement their temporally neighboring steps and corresponding position information (Fig. \ref{['fig:aligned-decision']}). Finally, the next action is generated according to the enhanced demonstration.
  • Figure 2: An illustration of our aligned decision method, where $B=F=1$ and the $i$-th retrieved step is at time $t^i$ in its trajectory. The aligned decision method consists of three sub-processes to the retrieved step demonstrations and prompting: 1) Temporal Expansion: Collect at most $B$ previous steps and $F$ subsequent steps for each retrieved step, and transform each step into a sequence of length $B+F+1$ from $t^i-B$ to $t^i+F$; 2) Relative Order Mark: For each step in one demonstration step sequence, we label its relative position to the retrieved step in this sequence, i.e., the previous one ($t^i-1$) with [Step -1] and the next one ($t^i+1$) with [Step 1]; 3) History Alignment: For the current episode, we complement current observation (and thought, optional) with $B+F$ previous steps to enrich information and align with demonstrations.
  • Figure 3: The effect of varying subsequent steps $F$ and previous steps $B$ on Mind2Web benchmark. Solid lines correspond to the performance metrics of TRAD given different $F$ and $B$, and the dashed lines correspond to the Synapse baseline. Forward expansion ($F>0$) generally provides more improvement than backward expansion ($B>0$) over no expansion ($F=B=0$) and the Synapse baseline. $F$ or $B$ does not help more when they are sufficiently large.
  • Figure 4: The effect of varying the number of retrieved demonstrations $K$ on Mind2Web benchmark. Solid lines correspond to the performance metrics of TRAD given different $K$, and the dashed lines correspond to the Synapse baseline. $K$ has a mild effect on the performance of TRAD and Synapse, and the advantage of TRAD over Synapse remains stable when $K$ varies.
  • Figure 5: Comparison between Synapse trajectory-wise retrieval with task meta-data and TRAD step-wise retrieval with thought. (a) The trajectory-wise retrieval of Synapse only considers "search" in task instructions and the retrieved trajectories are completely irrelevant. However, by generating thoughts with these irrelevant trajectories, thought retrieval finds more relevant step-wise demonstrations related to baby (toddler) and navigation. (b) The trajectory-wise retrieval of Synapse retrieves plausible examples which do not type in a text box with task meta-data. Although thoughts are imperfect, thought retrieval finds more relevant demonstrations and TRAD learns to input "New York".
  • ...and 2 more figures