Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

Shuang Qiu; Lingxiao Wang; Chenjia Bai; Zhuoran Yang; Zhaoran Wang

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

Shuang Qiu, Lingxiao Wang, Chenjia Bai, Zhuoran Yang, Zhaoran Wang

TL;DR

This work tackles the challenge of improving sample efficiency in reinforcement learning by integrating contrastive self-supervised representation learning into online RL under low-rank transition models. The authors propose Contrastive UCB for single-agent MDPs and its extension to zero-sum Markov games, combining temporal contrastive losses with UCB-based exploration bonuses and a CC E-based policy update. They prove representation recovery and $ ilde{O}(1/\\varepsilon^2)$ sample complexity for achieving an $\\varepsilon$-optimal policy in MDPs and an $\\varepsilon$-approximate Nash equilibrium in MGs, respectively, along with matching MG-theoretic analysis. Empirical validation on Atari 100K benchmarks demonstrates practical gains, with SPR-UCB outperforming several baselines, underscoring the value of contrastive representation learning for efficient online RL.

Abstract

In view of its power in extracting feature representation, contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL), leading to efficient policy learning in various applications. Despite its tremendous empirical successes, the understanding of contrastive learning for RL remains elusive. To narrow such a gap, we study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. For both models, we propose to extract the correct feature representations of the low-rank model by minimizing a contrastive loss. Moreover, under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs. We further theoretically prove that our algorithm recovers the true representations and simultaneously achieves sample efficiency in learning the optimal policy and Nash equilibrium in MDPs and MGs. We also provide empirical studies to demonstrate the efficacy of the UCB-based contrastive learning method for RL. To the best of our knowledge, we provide the first provably efficient online RL algorithm that incorporates contrastive learning for representation learning. Our codes are available at https://github.com/Baichenjia/Contrastive-UCB.

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

TL;DR

sample complexity for achieving an

-optimal policy in MDPs and an

-approximate Nash equilibrium in MGs, respectively, along with matching MG-theoretic analysis. Empirical validation on Atari 100K benchmarks demonstrates practical gains, with SPR-UCB outperforming several baselines, underscoring the value of contrastive representation learning for efficient online RL.

Abstract

Paper Structure (28 sections, 19 theorems, 202 equations, 2 figures, 2 tables, 4 algorithms)

This paper contains 28 sections, 19 theorems, 202 equations, 2 figures, 2 tables, 4 algorithms.

Introduction
Preliminaries
Contrastive Learning for Single-Agent MDP
Algorithm
Main Result for Single-Agent MDP Setting
Contrastive Learning for Markov Game
Algorithm
Main Result for Markov Game Setting
Theoretical Analysis
Analysis for Single-Agent MDP
Analysis for Markov Game
Proof of Concept Experiments
Implementation of Bonus
Environments and Baselines
Result Comparison
...and 13 more sections

Key Result

Theorem 3.6

Letting $\lambda_k= c_0 d \log(H|\mathcal{F}|k/\delta)$ for a sufficiently large constant $c_0>0$ and $\gamma_k= 4H(12\sqrt{ |\mathcal{A}|d} + \sqrt{c_0} d)/C_{\mathcal{S}}^-\cdot \sqrt{\log (2Hk|\mathcal{F}|/\delta)}$, with probability at least $1-3\delta$, we have where $C = H^4d^4|\mathcal{A}|/(C_{\mathcal{S}}^-)^2 + H^4d^3|\mathcal{A}|^2/(C_{\mathcal{S}}^-)^2 + H^6d^2|\mathcal{A}|/(C_{\mat

Figures (2)

Figure 1: Mean human-normalized score in Atari-100K benchmark. The results of baseline algorithms are adopted from agarwal2021deep. We observe that SPR-UCB outperforms SPR and other baseline algorithms.
Figure 2: Stratified Bootstrap agarwal2021deep of experiments, with $95\%$ confidence intervals (CIs) based on $26$ Atari 100K tasks. Higher mean, median, interquartile mean (IQM), and lower optimality gap a better. See agarwal2021deep for details. The results for baseline algorithms are collected from the report by agarwal2021deep. The results for SPR-UCB are based on $10$ runs per game.

Theorems & Definitions (42)

Remark 2.2
Remark 3.1
Definition 3.3: Function Class
Remark 3.4
Remark 3.5
Theorem 3.6: Sample Complexity
Definition 4.1: $\iota$-CCE
Theorem 4.2: Sample Complexity
Lemma 5.1: Transition Recovery
Lemma 5.2: Transition Recovery
...and 32 more

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

TL;DR

Abstract

Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (42)