Table of Contents
Fetching ...

Reinforcement Learning from Multi-level and Episodic Human Feedback

Muhammad Qasim Elahi, Somtochukwu Oguchienti, Maheed H. Ahmed, Mahsa Ghasemi

TL;DR

This work addresses reinforcement learning in settings where per-step rewards are unavailable and human feedback arrives only as end-of-episode, multi-level scores. It introduces a K-ary trajectory feedback model and an online optimistic algorithm (K-UCBVI) that learns a latent trajectory-based reward via a concatenated weight vector $\\mathbf{w}^* \\in \\mathbb{R}^{K d}$ and uses concentration-based estimates $\\widehat{\\mathbf{w}}_n$ to form an optimistic reward $\\overline{R}(\\widehat{\\mathbf{w}}_n,\\tau)$, with Markovian policy approximations via REINFORCE. The main contributions are a sublinear regret bound $\\mathcal{CR}(N) = \\mathcal{O}(\\sqrt{N} \\log \\frac{N \\log N}{\\delta})$ and extensive grid-world experiments demonstrating effective learning from end-of-episode, multi-level feedback. This approach advances RL in non-Markovian reward settings by leveraging richer human feedback beyond binary comparisons and provides practical pathways for policy optimization under such signals.

Abstract

Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback, expressed as a preference for one behavior over another, to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on the evaluation of an entire episode. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.

Reinforcement Learning from Multi-level and Episodic Human Feedback

TL;DR

This work addresses reinforcement learning in settings where per-step rewards are unavailable and human feedback arrives only as end-of-episode, multi-level scores. It introduces a K-ary trajectory feedback model and an online optimistic algorithm (K-UCBVI) that learns a latent trajectory-based reward via a concatenated weight vector and uses concentration-based estimates to form an optimistic reward , with Markovian policy approximations via REINFORCE. The main contributions are a sublinear regret bound and extensive grid-world experiments demonstrating effective learning from end-of-episode, multi-level feedback. This approach advances RL in non-Markovian reward settings by leveraging richer human feedback beyond binary comparisons and provides practical pathways for policy optimization under such signals.

Abstract

Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms of human input to specify or refine the reward function. Reinforcement learning from human feedback is a prominent approach that utilizes human comparative feedback, expressed as a preference for one behavior over another, to tackle this problem. In contrast to comparative feedback, we explore multi-level human feedback, which is provided in the form of a score at the end of each episode. This type of feedback offers more coarse but informative signals about the underlying reward function than binary feedback. Additionally, it can handle non-Markovian rewards, as it is based on the evaluation of an entire episode. We propose an algorithm to efficiently learn both the reward function and the optimal policy from this form of feedback. Moreover, we show that the proposed algorithm achieves sublinear regret and demonstrate its empirical effectiveness through extensive simulations.

Paper Structure

This paper contains 16 sections, 4 theorems, 82 equations, 2 figures, 2 algorithms.

Key Result

lemma 1

For any episode $n \in [N]$, the following holds with probability at least $1 - \delta$: where $\Sigma_{D_n} = \frac{1}{nK^2} \sum_{i=1}^n \sum_{j=0}^{K-1} \sum_{l=0}^{K-1}(\phi_j(\tau^{(i)})-\phi_l(\tau^{(i)}))$$(\phi_j(\tau^{(i)})-\phi_l(\tau^{(i)}))^T$, $\eta = \frac{\exp(-4B)}{2}$ and $C = \log{{(}K\exp(2B){)}}$.

Figures (2)

  • Figure 1: (a) $8\times8$ grid-world environment with the danger state (red cell), wall state (gray cells) and goal state (green cell) depicted. (b) Plot of the average true reward against the number of episodes for $K=4$. (c) Plot of the average true reward against the number of episodes for $K=6$.
  • Figure 2: (a) Impact of varying noisy feedback on the learned policy $(K=4)$. (b) Impact of varying noisy feedback on the learned policy $(K=6)$. (c) Effect of the confidence bound parameter on the learned policy $(K=4)$.

Theorems & Definitions (6)

  • lemma 1
  • lemma 2
  • theorem 1
  • definition 1: Softmax Policies
  • remark 1
  • lemma 3