Table of Contents
Fetching ...

Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

Kevin Vora, Yu Zhang

TL;DR

The paper addresses reward adaptation in reinforcement learning for discrete MDPs by proposing Q-Manipulation (Q-M), a bounds-based approach that uses a lite-model of environment dynamics and a known combination function f over source rewards to compute upper and lower bounds on the target Q-function $Q_{\mathcal{R}}^*$. Through iterative Bellman-like updates, Q-M prunes suboptimal actions before learning, yielding improved sample efficiency while preserving optimality under proper initialization. An extension, Monotonic Q-Manipulation (M-Q-M), refines bounds to tighten pruning and accelerate convergence; the authors provide contraction-based convergence proofs and optimality guarantees, including cases with linear and noisy reward combinations. Empirically, Q-M and M-Q-M outperform SFQL, SQB, and vanilla Q-Learning across gridworlds and autogenerated MDPs, with pronounced gains when source-target reward functions are reasonably aligned, while nonlinear or highly noisy mappings reduce benefits. The work discusses limitations (e.g., scalability to continuous spaces, safety concerns) and outlines directions for extending the methodology to broader domain shifts and real-world applications.

Abstract

In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

TL;DR

The paper addresses reward adaptation in reinforcement learning for discrete MDPs by proposing Q-Manipulation (Q-M), a bounds-based approach that uses a lite-model of environment dynamics and a known combination function f over source rewards to compute upper and lower bounds on the target Q-function . Through iterative Bellman-like updates, Q-M prunes suboptimal actions before learning, yielding improved sample efficiency while preserving optimality under proper initialization. An extension, Monotonic Q-Manipulation (M-Q-M), refines bounds to tighten pruning and accelerate convergence; the authors provide contraction-based convergence proofs and optimality guarantees, including cases with linear and noisy reward combinations. Empirically, Q-M and M-Q-M outperform SFQL, SQB, and vanilla Q-Learning across gridworlds and autogenerated MDPs, with pronounced gains when source-target reward functions are reasonably aligned, while nonlinear or highly noisy mappings reduce benefits. The work discusses limitations (e.g., scalability to continuous spaces, safety concerns) and outlines directions for extending the methodology to broader domain shifts and real-world applications.

Abstract

In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

Paper Structure

This paper contains 28 sections, 11 theorems, 71 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

$Q^{\mu}_{R}(s,a) = -Q^{*}_{-R}(s,a)$, where $Q^{*}_{-R}(s,a)$ denotes the Q function of the optimal policy under negative $R$ or $-R$.

Figures (6)

  • Figure 1: Convergence plots for Dollar Euro (top), Racetrack (mid), and Frozen Lake (bottom).
  • Figure 2: Heat-maps illustrating action pruning in the Dollar Euro domain using M-Q-M (top) and Q-M (bottom). Lighter shade of blue indicates fewer action remain after pruning.
  • Figure 3: Convergence plots for auto-generated domains: $\mathcal{R}=R_1 +R_2$ (top) and $\mathcal{R}=(R_1+R_2)^{3}$ (bottom).
  • Figure 4: Performance under varying noise in auto-generated domains. Left: Actions pruned (%) vs. noise. Middle: M-Q-M convergence. Right: Q-M convergence. (color: dark = M-Q-M, lighter = Q-M).
  • Figure 5: Convergence plot for autogenerated domain with linear reward combination: Behavior 1 (left), Behavior 2 (center) and Target (right) where M-Q-M_p performs action pruning using a estimated  lite model, reward model, and Q-variants. M-Q-M_p indicates M-Q-M in practice.
  • ...and 1 more figures

Theorems & Definitions (23)

  • Lemma 3.1
  • Definition 3.2
  • Definition 3.3: Q-M Bellman Operators
  • Theorem 3.4: Q-M Convergence
  • Theorem 3.5
  • Definition 3.6
  • Theorem 3.7: M-Q-M Convergence
  • Theorem 3.8
  • Corollary 3.9: Non-uniqueness
  • Lemma 3.10
  • ...and 13 more