Table of Contents
Fetching ...

Transition Transfer $Q$-Learning for Composite Markov Decision Processes

Jinhang Chai, Elynn Chen, Lin Yang

TL;DR

This work introduces a composite MDP framework where transition dynamics decompose into a low-rank shared component $L^*$ plus a sparse task-specific component $S^*$, enabling principled transfer in high-dimensional RL. It develops single-task UCB-$Q$-Learning for HD composite MDPs and a transfer-enabled UCB-TQL algorithm that leverages a source task to reduce target regret, achieving dimension-independent guarantees that scale with rank and sparsity. The transfer analysis shows that, with enough source data, the target regret can attain a rate of $\tilde{O}(\sqrt{eH^5N})$, where $e$ is the sparse difference, effectively decoupling from ambient dimension $d$. The work provides rigorous estimation error bounds for matrix recovery and introduces refined confidence regions that exploit sparsity differences, bridging theory and the practical benefits of transfer in complex transition dynamics.

Abstract

To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of $\tilde{O}(\sqrt{eH^5N})$ that scales independently of the ambient dimension. Here, $N$ represents the number of trajectories in the target task, while $e$ quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.

Transition Transfer $Q$-Learning for Composite Markov Decision Processes

TL;DR

This work introduces a composite MDP framework where transition dynamics decompose into a low-rank shared component plus a sparse task-specific component , enabling principled transfer in high-dimensional RL. It develops single-task UCB--Learning for HD composite MDPs and a transfer-enabled UCB-TQL algorithm that leverages a source task to reduce target regret, achieving dimension-independent guarantees that scale with rank and sparsity. The transfer analysis shows that, with enough source data, the target regret can attain a rate of , where is the sparse difference, effectively decoupling from ambient dimension . The work provides rigorous estimation error bounds for matrix recovery and introduces refined confidence regions that exploit sparsity differences, bridging theory and the practical benefits of transfer in complex transition dynamics.

Abstract

To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of that scales independently of the ambient dimension. Here, represents the number of trajectories in the target task, while quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.

Paper Structure

This paper contains 23 sections, 7 theorems, 76 equations, 2 algorithms.

Key Result

Lemma 1

For composite MDPs in Definition ass:matrix, under Assumption ass:low-rank-sparse and asp:regularity, the estimator obtained by solving program eq:rl-rank-sparse at the end of $n^{th}$-episode satisfies, with probability at least $1-1/(n^2H)$, that, where $\beta_n$ is defined in def:beta_n.

Theorems & Definitions (21)

  • Definition 1: Composite MDPs
  • Remark 1
  • Remark 2
  • Lemma 1: Transition Estimation Error
  • Remark 3
  • Theorem 1: Single-Task Regret Upper Bound
  • Remark 4
  • Remark 5
  • Lemma 2: Estimation Error
  • Remark 6
  • ...and 11 more