Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

Chenjia Bai; Lingxiao Wang; Jianye Hao; Zhuoran Yang; Bin Zhao; Zhen Wang; Xuelong Li

Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

Chenjia Bai, Lingxiao Wang, Jianye Hao, Zhuoran Yang, Bin Zhao, Zhen Wang, Xuelong Li

TL;DR

This work tackles offline reinforcement learning when a target task has limited data by introducing UTDS, an uncertainty-based multi-task data sharing framework. UTDS shares all available data across related tasks and uses an ensemble of $Q$-networks to quantify uncertainty, applying pessimistic value updates that penalize high-uncertainty, including an explicit mechanism for out-of-distribution actions. The authors establish a theoretical connection between the UTDS uncertainty penalty and a $\xi$-uncertainty quantifier in linear MDPs, showing the optimality gap is governed by the expected data coverage of the shared dataset. Empirically, UTDS on a multi-task benchmark built from the DeepMind Control Suite demonstrates consistent improvements over Direct Sharing and CDS-based baselines, with code and datasets released for reproducibility and further study.

Abstract

Offline Reinforcement Learning (RL) has shown promising results in learning a task-specific policy from a fixed dataset. However, successful offline RL often relies heavily on the coverage and quality of the given dataset. In scenarios where the dataset for a specific task is limited, a natural approach is to improve offline RL with datasets from other tasks, namely, to conduct Multi-Task Data Sharing (MTDS). Nevertheless, directly sharing datasets from other tasks exacerbates the distribution shift in offline RL. In this paper, we propose an uncertainty-based MTDS approach that shares the entire dataset without data selection. Given ensemble-based uncertainty quantification, we perform pessimistic value iteration on the shared offline dataset, which provides a unified framework for single- and multi-task offline RL. We further provide theoretical analysis, which shows that the optimality gap of our method is only related to the expected data coverage of the shared dataset, thus resolving the distribution shift issue in data sharing. Empirically, we release an MTDS benchmark and collect datasets from three challenging domains. The experimental results show our algorithm outperforms the previous state-of-the-art methods in challenging MTDS problems. See https://github.com/Baichenjia/UTDS for the datasets and code.

Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

TL;DR

-networks to quantify uncertainty, applying pessimistic value updates that penalize high-uncertainty, including an explicit mechanism for out-of-distribution actions. The authors establish a theoretical connection between the UTDS uncertainty penalty and a

-uncertainty quantifier in linear MDPs, showing the optimality gap is governed by the expected data coverage of the shared dataset. Empirically, UTDS on a multi-task benchmark built from the DeepMind Control Suite demonstrates consistent improvements over Direct Sharing and CDS-based baselines, with code and datasets released for reproducibility and further study.

Abstract

Paper Structure (42 sections, 8 theorems, 57 equations, 26 figures, 3 tables, 1 algorithm)

This paper contains 42 sections, 8 theorems, 57 equations, 26 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
Offline Reinforcement Learning
Multi-Task Data Sharing (MTDS)
Conservative Data Sharing (CDS)
Method
Uncertainty Quantifier
Pessimistic Value Iteration
Theoretical Analysis
UTDS in Linear MDPs
Optimality Gap
Related Work
Experiments
Tasks and Datasets
Baselines
...and 27 more sections

Key Result

Theorem 1

For a given state-action pair $(s,a)$, we denote the uncertainty for the single-task dataset ${\mathcal{D}}_i$ and shared dataset ${\widehat{\mathcal{D}}}_i={\mathcal{D}}_i\cup {\mathcal{D}}_{j\rightarrow i}$ as $\Gamma_i(s,a;{\mathcal{D}}_i)$ and $\Gamma_i(s,a;{\widehat{\mathcal{D}}}_i)$, respectiv where signifies that the shared data reduce the ensemble uncertainty.

Figures (26)

Figure 1: The illustration of CDS and UTDS for MTDS in training task $A_1$. (a) CDS includes a data selection process through the learned conservative value function. The selected data is added to the mixed dataset. (b) UTDS can share all data from other tasks without data selection. In policy training, UTDS performs pessimistic updates based on uncertainty in the large shared dataset.
Figure 2: An illustration of the uncertainty quantification of UTDS in (a) single-task dataset and (b) multi-task shared dataset. The multi-task datasets (i.e., the white, brown, and orange points) are generated by different Gaussian distributions. The uncertainty measured by the ensemble networks is represented by the color scales in the figures. The darker color means smaller uncertainty. As shown in the figures, sharing more data decreases the uncertainty.
Figure 3: The comparison between UTDS and Direct Sharing in Jaco Arm. The main task is Reach-Bottom-Left, and the shared data are replay datasets from the other three tasks (denoted as '+'). We show results of 5 dataset types of the main task. The shadow bars show the single-task scores.
Figure 4: The comparison between UTDS and CDS in Walker. The main tasks are Walker-Walk (Top) and Walker-Flip (Bottom), respectively. UTDS generally improves the performance compared to the single-task scores (i.e., the shadow bar), especially for non-expert datasets.
Figure 5: The illustration of ensemble uncertainty of UTDS calculated on training batches. The main task is Walker Flip with expert (left) and replay (right) datasets.
...and 21 more figures

Theorems & Definitions (15)

Definition 1: $\xi$-Uncertainty Quantifier pevi-2021
Claim 1
Theorem 1
Theorem 2
Corollary 1
Lemma 1: Equivalence between LCB-penalty and Ensemble Uncertainty
proof
Theorem : Theorem \ref{['thm:uncertainty-d']} restate
proof
Theorem
...and 5 more

Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

TL;DR

Abstract

Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (15)