Table of Contents
Fetching ...

DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback

Riku Arakawa, Sosuke Kobayashi, Yuya Unno, Yuta Tsuboi, Shin-ichi Maeda

TL;DR

The paper tackles the exploration challenge in reinforcement learning for robotics by introducing a human-in-the-loop framework, DQN-TAMER, that blends immediate human feedback with distant environmental rewards. It formalizes five realism factors for human feedback—binary, delay, stochasticity, unsustainability, and natural reaction—and demonstrates that DQN-TAMER outperforms DQN and Deep TAMER in Maze and Taxi tasks. A GoPiGo3 car demonstration shows the system can leverage facial-expression feedback despite classifier errors, supporting practical deployment. Overall, the approach provides a robust, scalable path toward real-world, human-in-the-loop RL in dynamic robotic environments.

Abstract

Exploration has been one of the greatest challenges in reinforcement learning (RL), which is a large obstacle in the application of RL to robotics. Even with state-of-the-art RL algorithms, building a well-learned agent often requires too many trials, mainly due to the difficulty of matching its actions with rewards in the distant future. A remedy for this is to train an agent with real-time feedback from a human observer who immediately gives rewards for some actions. This study tackles a series of challenges for introducing such a human-in-the-loop RL scheme. The first contribution of this work is our experiments with a precisely modeled human observer: binary, delay, stochasticity, unsustainability, and natural reaction. We also propose an RL method called DQN-TAMER, which efficiently uses both human feedback and distant rewards. We find that DQN-TAMER agents outperform their baselines in Maze and Taxi simulated environments. Furthermore, we demonstrate a real-world human-in-the-loop RL application where a camera automatically recognizes a user's facial expressions as feedback to the agent while the agent explores a maze.

DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback

TL;DR

The paper tackles the exploration challenge in reinforcement learning for robotics by introducing a human-in-the-loop framework, DQN-TAMER, that blends immediate human feedback with distant environmental rewards. It formalizes five realism factors for human feedback—binary, delay, stochasticity, unsustainability, and natural reaction—and demonstrates that DQN-TAMER outperforms DQN and Deep TAMER in Maze and Taxi tasks. A GoPiGo3 car demonstration shows the system can leverage facial-expression feedback despite classifier errors, supporting practical deployment. Overall, the approach provides a robust, scalable path toward real-world, human-in-the-loop RL in dynamic robotic environments.

Abstract

Exploration has been one of the greatest challenges in reinforcement learning (RL), which is a large obstacle in the application of RL to robotics. Even with state-of-the-art RL algorithms, building a well-learned agent often requires too many trials, mainly due to the difficulty of matching its actions with rewards in the distant future. A remedy for this is to train an agent with real-time feedback from a human observer who immediately gives rewards for some actions. This study tackles a series of challenges for introducing such a human-in-the-loop RL scheme. The first contribution of this work is our experiments with a precisely modeled human observer: binary, delay, stochasticity, unsustainability, and natural reaction. We also propose an RL method called DQN-TAMER, which efficiently uses both human feedback and distant rewards. We find that DQN-TAMER agents outperform their baselines in Maze and Taxi simulated environments. Furthermore, we demonstrate a real-world human-in-the-loop RL application where a camera automatically recognizes a user's facial expressions as feedback to the agent while the agent explores a maze.

Paper Structure

This paper contains 21 sections, 8 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of human-in-the-loop RL and our model (DQN-TAMER). The agent asynchronously interacts with a human observer in the given environment. DQN-TAMER decides actions based on two models. One (Q) estimates rewards from the environment and the other (H) for feedback from the human.
  • Figure 2: Maze: an environment with walls (black squares), the agent, and the goal.
  • Figure 3: Taxi: an environment with walls ( $|$ ; bold bars), the taxi agent, the passenger (at G), and the goal (Y).
  • Figure 4: Maze results (upper: high frequency, lower: low frequency, left: without delay, and right: with delay).
  • Figure 5: Maze with feedback stop. Feedback ends after 30 episodes. left: MDP, right: POMDP
  • ...and 2 more figures