Table of Contents
Fetching ...

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar

TL;DR

This work formalizes Hierarchical Reward Design (HRD) to enable richer, language-based specification of long-horizon agent behavior by decomposing rewards into low- and high-level components. It introduces HRDL, an instantiation that uses natural language to define these hierarchical rewards, and proposes Language to Hierarchical Rewards (L2HR), a prompting-and-training procedure that generates executable, hierarchical reward programs from language. Empirical results across Rescue World, iTHOR, and Kitchen show that HRD-based rewards improve high-level alignment and task feasibility compared with flat rewards, with human evaluations corroborating stronger perceived alignment. The approach advances human-aligned AI by providing a scalable, language-driven method to tailor hierarchical RL to complex behavioral specifications, while outlining future work on real-world deployment and richer reward-generation techniques.

Abstract

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

TL;DR

This work formalizes Hierarchical Reward Design (HRD) to enable richer, language-based specification of long-horizon agent behavior by decomposing rewards into low- and high-level components. It introduces HRDL, an instantiation that uses natural language to define these hierarchical rewards, and proposes Language to Hierarchical Rewards (L2HR), a prompting-and-training procedure that generates executable, hierarchical reward programs from language. Empirical results across Rescue World, iTHOR, and Kitchen show that HRD-based rewards improve high-level alignment and task feasibility compared with flat rewards, with human evaluations corroborating stronger perceived alignment. The approach advances human-aligned AI by providing a scalable, language-driven method to tailor hierarchical RL to complex behavioral specifications, while outlining future work on real-world deployment and richer reward-generation techniques.

Abstract

When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.
Paper Structure (51 sections, 5 theorems, 7 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 51 sections, 5 theorems, 7 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Let $\mathcal{M}_p = (\mathcal{S}, \mathcal{A}, T, \gamma)$ be a world model, $\mathcal{O}$ a set of options, and $r_L: \mathcal{S} \times \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R}$ a low‑level reward. For a fixed option $o \in \mathcal{O}$, the tuple $\mathcal{M}_{L, o} = (\mathcal{S},

Figures (8)

  • Figure 1: This work introduces the Hierarchical Reward Design from Language (HRDL) problem. Unlike prior work on reward design, HRDL decomposes reward design into low- and high-level components $(\tilde{r}_L, \tilde{r}_H)$. Language to Hierarchical Rewards (L2HR), our proposed solution to HRDL, leverages language models to guide the synthesis of these hierarchical rewards, enabling existing RL algorithms to train agents that are better aligned with human specifications.
  • Figure 2: Although prior works have utilized multi-level rewards to train hierarchical agents, the design of such rewards remains underexplored and lacks a concrete problem formulation. This work bridges this gap by introducing HRDL.
  • Figure 3: Illustration of L2HR input and output.
  • Figure 4: Rendered scenes from the iTHOR and Kitchen domains. Fig. \ref{['fig:intro']} (right) illustrates the Rescue World domain.
  • Figure 5: Syntax error rates for LLM-based reward generation, computed over 24 candidates per configuration.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 1: Low-level Reward
  • Proposition 1: Low-level MDP Models
  • Definition 2: High-level Reward
  • Proposition 2: High-level SMDP Model
  • Definition 3: Hierarchical Reward Design (HRD)
  • Definition 4: Hierarchical Reward Design from Language (HRDL)
  • Proposition 1: Low-level MDP Models
  • Proposition 2: High-level SMDP Model
  • Proposition 3: High-level MDP Model