Table of Contents
Fetching ...

CANDERE-COACH: Reinforcement Learning from Noisy Feedback

Yuxuan Li, Srijita Das, Matthew E. Taylor

TL;DR

The CANDERE-COACH algorithm is proposed, which is capable of learning from noisy feedback by a nonoptimal teacher, and a noise-filtering mechanism to de-noise online feedback data is proposed, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect.

Abstract

In recent times, Reinforcement learning (RL) has been widely applied to many challenging tasks. However, in order to perform well, it requires access to a good reward function which is often sparse or manually engineered with scope for error. Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher's (positive or negative) evaluation of the agent's action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a nonoptimal teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect. Experiments on three common domains demonstrate the effectiveness of the proposed approach.

CANDERE-COACH: Reinforcement Learning from Noisy Feedback

TL;DR

The CANDERE-COACH algorithm is proposed, which is capable of learning from noisy feedback by a nonoptimal teacher, and a noise-filtering mechanism to de-noise online feedback data is proposed, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect.

Abstract

In recent times, Reinforcement learning (RL) has been widely applied to many challenging tasks. However, in order to perform well, it requires access to a good reward function which is often sparse or manually engineered with scope for error. Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher's (positive or negative) evaluation of the agent's action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a nonoptimal teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect. Experiments on three common domains demonstrate the effectiveness of the proposed approach.
Paper Structure (29 sections, 5 equations, 19 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: The overview of CANDERE-COACH . We use a classifier $C_\phi$ to filter noisy feedback and update policy $\pi_\theta$ and $C_\phi$ with filtered minibatches.
  • Figure 2: (a) Cart Pole, (a) Minigrid Door Key, and (c) Lunar Lander are used for evaluation
  • Figure 3: Performance of Deep COACH under different scales of noises in Cart Pole. While with an unlimited budget the Deep COACH is even able to learn against 40% noise slowly, the performance of Deep COACH significantly deteriorates with a limited budget.
  • Figure 4: Performance of CANDERE-COACH in Cart Pole, Door Key and Lunar Lander in 30% and 40% noise
  • Figure 5: Performance of CANDERE-COACH in Cart Pole, with noisy pretraining dataset
  • ...and 14 more figures