Table of Contents
Fetching ...

EPO: Hierarchical LLM Agents with Environment Preference Optimization

Qi Zhao, Haotian Fu, Chen Sun, George Konidaris

TL;DR

This paper proposes a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation and introduces Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment’s feedback and uses them to train LLM-based agents.

Abstract

Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment's feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.

EPO: Hierarchical LLM Agents with Environment Preference Optimization

TL;DR

This paper proposes a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation and introduces Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment’s feedback and uses them to train LLM-based agents.

Abstract

Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment's feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.
Paper Structure (18 sections, 3 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustration of the hierarchical framework. Our agent first outputs the subgoals from human instructions and visual inputs using its high-level subgoal decomposition module. Then the interaction module predicts low-level actions autoregressively to complete the given subgoals.
  • Figure 2: An illustration of our pipeline to train reward model for grounding environment feedback with human instructions. We supervisedly train the reward model given the annotated data. Then we use the reward model to label unannotated data to obtain the preference relations. Then we form the EPO datasets and optimize our agent policies using the proposed EPO algorithm.
  • Figure 3: An visual illustration of how EPO improved both high-level subgoal decomposition policy and the low-level interaction policy. In the left figure, we present the difference between a baseline high-level policy and a EPO trained counterpart. We observe that the latter one can correctly figure out the subgoal. In the right figure, we present the difference between a baseline low-level policy and a EPO trained counterpart. We observe that the latter one can conduct post adjustment to successfully execute the actions.
  • Figure 4: A illustration of prompt to our LLM policies. From top to bottom: example of baseline subgoal policy, example of baseline interaction policy, example of interaction feedback , example of visual feedback , example of reward model training Data, example of Environment Preference Data