Table of Contents
Fetching ...

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen

TL;DR

The paper tackles the granularity mismatch in offline alignment of long-horizon LLM agents by introducing Hierarchical Preference Learning (HPL), which integrates trajectory-, action-, and group-level Direct Preference Optimization (DPO) losses alongside a dual-layer curriculum. By bootstrapping from expert behavior, generating multi-granularity preference data, and using semantically meaningful action groups, HPL achieves stronger performance on ALFWorld, WebShop, and InterCode-SQL than prior baselines. The curriculum guides learning from simple sub-tasks to complex, multi-step tasks, and an ablation study highlights the critical role of group-level DPO in driving gains. Overall, HPL advances offline alignment by leveraging hierarchical signals that better credit intermediate sub-tasks and long-horizon planning, with practical implications for scalable, safer autonomous LLM agents.

Abstract

Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

TL;DR

The paper tackles the granularity mismatch in offline alignment of long-horizon LLM agents by introducing Hierarchical Preference Learning (HPL), which integrates trajectory-, action-, and group-level Direct Preference Optimization (DPO) losses alongside a dual-layer curriculum. By bootstrapping from expert behavior, generating multi-granularity preference data, and using semantically meaningful action groups, HPL achieves stronger performance on ALFWorld, WebShop, and InterCode-SQL than prior baselines. The curriculum guides learning from simple sub-tasks to complex, multi-step tasks, and an ablation study highlights the critical role of group-level DPO in driving gains. Overall, HPL advances offline alignment by leveraging hierarchical signals that better credit intermediate sub-tasks and long-horizon planning, with practical implications for scalable, safer autonomous LLM agents.

Abstract

Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.

Paper Structure

This paper contains 41 sections, 2 theorems, 32 equations, 5 figures, 6 tables.

Key Result

Proposition 1

Let $T$ denote the trajectory length, $\gamma\in[0,1)$ the discount factor, and $R_\text{max}$ the maximum reward. Let $\mathcal{L}_\text{traj}$, $\mathcal{L}_\text{step}$, and $\mathcal{L}_\text{group}(k)$ denote the empirical losses of trajectory-level, step-level, and group-level DPO with group l

Figures (5)

  • Figure 1: Conceptual comparison of different DPO granularities. While (a) trajectory-level DPO provides a coarse signal and (b) step-level DPO provides a myopic one, (c) our proposed Group-level DPO learns from semantically coherent action groups, which provides a structured and meaningful signal, enabling the agent to reason at the sub-task level.
  • Figure 2: An overview of our proposed framework, HPL. Stage 1 generates hierarchical preference data with Action Group Segmentation component. Stage 2 then optimizes the agent with a composite objective, where the training is guided by dual-layer curriculum scheduler.
  • Figure 3: Illustration of the dual-layer curriculum scheduler with group length ($L$) and sample difficulty ($\Delta R$). The training follows a three-phase schedule.
  • Figure 4: Phase-wise performance progression of HPL on the ALFWorld benchmark. (a) Success rates for both 1.5B and 7B models across the three curriculum phases. (b) A detailed breakdown for the 1.5B model on 6 sub-task types.
  • Figure 5: Ablation study on the HPL loss components on Qwen2.5-7B-Instruct.

Theorems & Definitions (3)

  • Proposition 1: Bias-variance trade-off of group-level DPO loss
  • Proposition 1: Bias-variance trade-off of group-level DPO loss
  • proof