Robust Knowledge Transfer in Tiered Reinforcement Learning
Jiawei Huang, Niao He
TL;DR
This work addresses robust parallel transfer in Tiered Reinforcement Learning where a source low-tier task ${M}_{Lo}$ and a target high-tier task ${M}_{Hi}$ are learned concurrently with unknown task similarity. It introduces Optimal Value Dominance ($OVD$) and transferable states to characterize when transferring knowledge helps, and then develops robust algorithms for both single and multiple source-task settings. For single-source MAB and RL, the proposed methods balance pessimistic transfer from ${M}_{Lo}$ with online exploration in ${M}_{Hi}$, achieving constant regret on transferable regions and near-optimal performance elsewhere; when ${M}_{Hi}={M}_{Lo}$, the bound improves over prior results. The framework extends to multiple source tasks with a Trust-till-Failure mechanism, enabling ensemble benefits across larger state-action spaces with a modest log-factor cost in regret. Overall, the work provides theoretical guarantees for robust, parallel transfer in diverse, partially similar tasks with practical guidance for source-task selection and aggregation.
Abstract
In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the ``Optimal Value Dominance'' for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.
