Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes
Kevin Vora, Yu Zhang
TL;DR
The paper addresses reward adaptation in reinforcement learning for discrete MDPs by proposing Q-Manipulation (Q-M), a bounds-based approach that uses a lite-model of environment dynamics and a known combination function f over source rewards to compute upper and lower bounds on the target Q-function $Q_{\mathcal{R}}^*$. Through iterative Bellman-like updates, Q-M prunes suboptimal actions before learning, yielding improved sample efficiency while preserving optimality under proper initialization. An extension, Monotonic Q-Manipulation (M-Q-M), refines bounds to tighten pruning and accelerate convergence; the authors provide contraction-based convergence proofs and optimality guarantees, including cases with linear and noisy reward combinations. Empirically, Q-M and M-Q-M outperform SFQL, SQB, and vanilla Q-Learning across gridworlds and autogenerated MDPs, with pronounced gains when source-target reward functions are reasonably aligned, while nonlinear or highly noisy mappings reduce benefits. The work discusses limitations (e.g., scalability to continuous spaces, safety concerns) and outlines directions for extending the methodology to broader domain shifts and real-world applications.
Abstract
In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "Q-Manipulation" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.
