Demonstration-Regularized RL

Daniil Tiapkin; Denis Belomestny; Daniele Calandriello; Eric Moulines; Alexey Naumov; Pierre Perrault; Michal Valko; Pierre Menard

Demonstration-Regularized RL

Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

TL;DR

The paper tackles the theoretical impact of expert demonstrations on reinforcement learning by introducing demonstration-regularized RL, which leverages a behavior cloning policy as a KL-regularized reference to accelerate best policy identification and RL from human feedback. It provides tight sample-complexity bounds in both finite and linear MDPs, showing that with $N^{\mathrm{E}}$ demonstrations the identification error scales as $\widetilde{O}(1/(\varepsilon^2 N^{\mathrm{E}}))$ (finite) or $\widetilde{O}(1/(\varepsilon^2 N^{\mathrm{E}}))$ in the linear setting, while BC itself achieves bounds on the trajectory KL divergence. The authors introduce fast-rate algorithms like \ref{'alg:UCBVIEnt+'} and \ref{'alg:LSVIUCBEnt'} that exploit the KL regularization to yield $\tilde{O}(H^5 S^2 A /(\lambda \varepsilon))$ and $\tilde{O}(H^5 d^2 /(\lambda \varepsilon))$ sample complexities in the finite and linear regimes, respectively, and extend the framework to RLHF with offline preferences, where the same order of sample complexity emerges. The work further shows that RLHF with demonstrations can be as efficient as RL with demonstrations, provided reward estimation and KL-regularization are carefully controlled, avoiding pessimism injection. Overall, the results establish demonstration-regularized approaches as theoretically sound, practically viable methods to improve RL sample efficiency in both standard and human-feedback settings.

Abstract

Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.

Demonstration-Regularized RL

TL;DR

demonstrations the identification error scales as

(finite) or

in the linear setting, while BC itself achieves bounds on the trajectory KL divergence. The authors introduce fast-rate algorithms like \ref{'alg:UCBVIEnt+'} and \ref{'alg:LSVIUCBEnt'} that exploit the KL regularization to yield

and

sample complexities in the finite and linear regimes, respectively, and extend the framework to RLHF with offline preferences, where the same order of sample complexity emerges. The work further shows that RLHF with demonstrations can be as efficient as RL with demonstrations, provided reward estimation and KL-regularization are carefully controlled, avoiding pessimism injection. Overall, the results establish demonstration-regularized approaches as theoretically sound, practically viable methods to improve RL sample efficiency in both standard and human-feedback settings.

Abstract

expert demonstrations enables the identification of an optimal policy at a sample complexity of order

in finite and

in linear Markov decision processes, where

is the target precision,

the horizon,

the number of action,

the number of states in the finite case and

the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.

Paper Structure (72 sections, 59 theorems, 381 equations, 1 table, 4 algorithms)

This paper contains 72 sections, 59 theorems, 381 equations, 1 table, 4 algorithms.

Introduction
Setting
MDPs
Policy & value functions
Trajectory Kullback-Leibler divergence
Behavior cloning
Imitation learning
Behavior cloning
Finite MDPs
Linear MDPs
Demonstration-regularized RL
Regularized best policy identification (BPI)
BPI with demonstration
Demonstration-regularized RL
\ref{['alg:UCBVIEnt+']} sampling rule
...and 57 more sections

Key Result

Theorem 1

Let Assumptions ass:regularity_of_hypothesis_class-ass:regularity_of_behaviour_policy be satisfied and let $0 \leq \mathcal{R}_h(\pi_h) \leq M$ for all $h\in[H]$ and any policy $\pi \in \mathcal{F}_h$. Then with probability at least $1-\delta,$ the behavior policy $\pi^{\mathrm{BC}}$ satisfies

Theorems & Definitions (121)

Definition 1
Definition 2
Theorem 1
Corollary 1
Remark 1
Remark 2
Corollary 2
Definition 3
Definition 4
Theorem 2
...and 111 more

Demonstration-Regularized RL

TL;DR

Abstract

Demonstration-Regularized RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (121)