On the Theory of Reinforcement Learning with Once-per-Episode Feedback
Niladri S. Chatterji, Aldo Pacchiano, Peter L. Bartlett, Michael I. Jordan
TL;DR
This work studies reinforcement learning when feedback is restricted to a binary end-of-episode signal generated by a logistic model over trajectory features. It develops optimism-based algorithms that jointly learn the logistic reward parameters and plan using an estimated transition model, achieving sublinear regret. A sublinear regret bound is established for a trajectory-labeling variant, and a computationally efficient version under an explorability assumption uses a sum-decomposable bonus with dynamic programming. The results advance theory for RL under extreme feedback constraints and point to future work on generalized GLMs and non-binary or ranked trajectory feedback.
Abstract
We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret.
