Table of Contents
Fetching ...

On the Theory of Reinforcement Learning with Once-per-Episode Feedback

Niladri S. Chatterji, Aldo Pacchiano, Peter L. Bartlett, Michael I. Jordan

TL;DR

This work studies reinforcement learning when feedback is restricted to a binary end-of-episode signal generated by a logistic model over trajectory features. It develops optimism-based algorithms that jointly learn the logistic reward parameters and plan using an estimated transition model, achieving sublinear regret. A sublinear regret bound is established for a trajectory-labeling variant, and a computationally efficient version under an explorability assumption uses a sum-decomposable bonus with dynamic programming. The results advance theory for RL under extreme feedback constraints and point to future work on generalized GLMs and non-binary or ranked trajectory feedback.

Abstract

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret.

On the Theory of Reinforcement Learning with Once-per-Episode Feedback

TL;DR

This work studies reinforcement learning when feedback is restricted to a binary end-of-episode signal generated by a logistic model over trajectory features. It develops optimism-based algorithms that jointly learn the logistic reward parameters and plan using an estimated transition model, achieving sublinear regret. A sublinear regret bound is established for a trajectory-labeling variant, and a computationally efficient version under an explorability assumption uses a sum-decomposable bonus with dynamic programming. The results advance theory for RL under extreme feedback constraints and point to future work on generalized GLMs and non-binary or ranked trajectory feedback.

Abstract

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret.

Paper Structure

This paper contains 42 sections, 25 theorems, 211 equations, 1 figure, 3 algorithms.

Key Result

Lemma 3.0

For any $\delta \in (0,1]$, define the event Then $\mathbb{P}( \mathcal{E}_\delta ) \geq 1-\delta$.

Figures (1)

  • Figure 1: Left: Reward learning curve averaged over $40$ independent runs. The shaded region represents a confidence interval which is $\pm \textrm{standard deviation}$. Middle: The purple and yellow paths represent two sample paths taken by an initial random policy. Right: The purple and yellow paths represent two sample paths taken by a trained policy.

Theorems & Definitions (29)

  • Lemma 3.0
  • Theorem 3.1
  • Lemma 3.3
  • Theorem 3.4
  • Proposition 3.4
  • Lemma A.1
  • Theorem A.2: Matrix Freedman inequality
  • Lemma A.3: Determinant Lemma
  • Lemma B.0
  • Lemma B.0
  • ...and 19 more