Table of Contents
Fetching ...

Leveraging Human Guidance for Deep Reinforcement Learning Tasks

Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H. Ballard, Peter Stone

TL;DR

The paper surveys human-guided deep reinforcement learning approaches that go beyond traditional demonstrations, focusing on evaluative feedback, human preferences, hierarchical guidance, imitation from observation, and attention-based signals. It analyzes how each framework defines signals, assumptions, and implementations, illustrating improvements in sample efficiency and performance on challenging tasks. Key contributions include mapping diverse feedback modalities to learning objectives, highlighting practical methods like TAMER, COACH, preference-based RL, and IfO, and outlining future directions such as data sharing, understanding trainers, and a unified lifelong learning paradigm. The work emphasizes combining multiple human guidance signals to create more robust, scalable learning systems for complex environments.

Abstract

Reinforcement learning agents can learn to solve sequential decision tasks by interacting with the environment. Human knowledge of how to solve these tasks can be incorporated using imitation learning, where the agent learns to imitate human demonstrated decisions. However, human guidance is not limited to the demonstrations. Other types of guidance could be more suitable for certain tasks and require less human effort. This survey provides a high-level overview of five recent learning frameworks that primarily rely on human guidance other than conventional, step-by-step action demonstrations. We review the motivation, assumption, and implementation of each framework. We then discuss possible future research directions.

Leveraging Human Guidance for Deep Reinforcement Learning Tasks

TL;DR

The paper surveys human-guided deep reinforcement learning approaches that go beyond traditional demonstrations, focusing on evaluative feedback, human preferences, hierarchical guidance, imitation from observation, and attention-based signals. It analyzes how each framework defines signals, assumptions, and implementations, illustrating improvements in sample efficiency and performance on challenging tasks. Key contributions include mapping diverse feedback modalities to learning objectives, highlighting practical methods like TAMER, COACH, preference-based RL, and IfO, and outlining future directions such as data sharing, understanding trainers, and a unified lifelong learning paradigm. The work emphasizes combining multiple human guidance signals to create more robust, scalable learning systems for complex environments.

Abstract

Reinforcement learning agents can learn to solve sequential decision tasks by interacting with the environment. Human knowledge of how to solve these tasks can be incorporated using imitation learning, where the agent learns to imitate human demonstrated decisions. However, human guidance is not limited to the demonstrations. Other types of guidance could be more suitable for certain tasks and require less human effort. This survey provides a high-level overview of five recent learning frameworks that primarily rely on human guidance other than conventional, step-by-step action demonstrations. We review the motivation, assumption, and implementation of each framework. We then discuss possible future research directions.

Paper Structure

This paper contains 15 sections, 3 figures.

Figures (3)

  • Figure 1: Human-agent-environment interaction diagrams of different approaches discussed in this paper. These diagrams illustrate how different types of human guidance data are collected, including information required by the human trainer and the guidance provided to the agent. Note that the learning process of the agent is not included in these diagrams. A: learning agent; E: environment; Arrow: information flow direction; Dashed arrow: optional information flow. In (a) standard imitation learning, the human trainer observes state information $s_t$ and demonstrates action $a^*_t$ to the agent; the agent stores this data to be used in learning later. In (b) learning from evaluative feedback, the human trainer watches the agent performing the task, and provides instant feedback $H_t$ on agent decision $a_t$ in state $s_t$. (c) Imitation from observation is similar to standard imitation learning except that the agent does not have access to human demonstrated action. (d) Learning attention from human requires the trainer to provide attention map $w_t$ to the learning agent.
  • Figure 2: Learning from human preference. The human trainer watches two behaviors generated by the learning agent simultaneously, an decides which behavior is more preferable. $\tau^1 \succ \tau^2$ denotes that the trainer prefers behavior trajectory $\tau^1$ over $\tau^2$.
  • Figure 3: Hierarchical imitation. HA: high-level agent; LA: low-level agent. The high-level agent chooses a high-level goal $g_t$ for state $s_t$. The low-level agent then chooses an action $a_t$ based on $g_t$ and $s_t$. The primary guidance that the trainer provides in this framework is the correct high-level goal $g^*_t$.