Table of Contents
Fetching ...

Offline Reinforcement Learning with Domain-Unlabeled Data

Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, Masashi Sugiyama

TL;DR

The paper tackles offline reinforcement learning when data come from multiple domains with shared state and action spaces but different dynamics, by introducing Positive-Unlabeled Offline RL (PUORL). It proposes a two-stage, plug-and-play approach that uses positive-unlabeled learning to train a domain classifier, filters unlabeled data to augment the small target-domain dataset, and then applies standard offline RL methods to the augmented data. Empirical results on a dynamics-shifted D4RL benchmark show strong policy performance with limited labeled data and high PU classifier accuracy across shifts, demonstrating robustness to domain differences. The work provides a practical method to leverage large amounts of domain-unlabeled data for offline RL and points to future extensions to reward-shift and broader weakly supervised learning scenarios.

Abstract

Offline reinforcement learning (RL) is vital in areas where active data collection is expensive or infeasible, such as robotics or healthcare. In the real world, offline datasets often involve multiple domains that share the same state and action spaces but have distinct dynamics, and only a small fraction of samples are clearly labeled as belonging to the target domain we are interested in. For example, in robotics, precise system identification may only have been performed for part of the deployments. To address this challenge, we consider Positive-Unlabeled Offline RL (PUORL), a novel offline RL setting in which we have a small amount of labeled target-domain data and a large amount of domain-unlabeled data from multiple domains, including the target domain. For PUORL, we propose a plug-and-play approach that leverages positive-unlabeled (PU) learning to train a domain classifier. The classifier then extracts target-domain samples from the domain-unlabeled data, augmenting the scarce target-domain data. Empirical results on a modified version of the D4RL benchmark demonstrate the effectiveness of our method: even when only 1 to 3 percent of the dataset is domain-labeled, our approach accurately identifies target-domain samples and achieves high performance, even under substantial dynamics shift. Our plug-and-play algorithm seamlessly integrates PU learning with existing offline RL pipelines, enabling effective multi-domain data utilization in scenarios where comprehensive domain labeling is prohibitive.

Offline Reinforcement Learning with Domain-Unlabeled Data

TL;DR

The paper tackles offline reinforcement learning when data come from multiple domains with shared state and action spaces but different dynamics, by introducing Positive-Unlabeled Offline RL (PUORL). It proposes a two-stage, plug-and-play approach that uses positive-unlabeled learning to train a domain classifier, filters unlabeled data to augment the small target-domain dataset, and then applies standard offline RL methods to the augmented data. Empirical results on a dynamics-shifted D4RL benchmark show strong policy performance with limited labeled data and high PU classifier accuracy across shifts, demonstrating robustness to domain differences. The work provides a practical method to leverage large amounts of domain-unlabeled data for offline RL and points to future extensions to reward-shift and broader weakly supervised learning scenarios.

Abstract

Offline reinforcement learning (RL) is vital in areas where active data collection is expensive or infeasible, such as robotics or healthcare. In the real world, offline datasets often involve multiple domains that share the same state and action spaces but have distinct dynamics, and only a small fraction of samples are clearly labeled as belonging to the target domain we are interested in. For example, in robotics, precise system identification may only have been performed for part of the deployments. To address this challenge, we consider Positive-Unlabeled Offline RL (PUORL), a novel offline RL setting in which we have a small amount of labeled target-domain data and a large amount of domain-unlabeled data from multiple domains, including the target domain. For PUORL, we propose a plug-and-play approach that leverages positive-unlabeled (PU) learning to train a domain classifier. The classifier then extracts target-domain samples from the domain-unlabeled data, augmenting the scarce target-domain data. Empirical results on a modified version of the D4RL benchmark demonstrate the effectiveness of our method: even when only 1 to 3 percent of the dataset is domain-labeled, our approach accurately identifies target-domain samples and achieves high performance, even under substantial dynamics shift. Our plug-and-play algorithm seamlessly integrates PU learning with existing offline RL pipelines, enabling effective multi-domain data utilization in scenarios where comprehensive domain labeling is prohibitive.
Paper Structure (39 sections, 1 equation, 2 figures, 13 tables, 4 algorithms)

This paper contains 39 sections, 1 equation, 2 figures, 13 tables, 4 algorithms.

Figures (2)

  • Figure 1: Diagram of Positive-Unlabeled Offline RL (PUORL). PUORL has a positive domain we target and negative domains, with different dynamics to the positive domain. We have two data types: positive data and domain-unlabeled data, which are mixtures of samples from the positive and negative domains. We train a policy to maximize the expected return in the positive domain.
  • Figure 2: Diagram of our method. We first train a classifier $f$ using PU learning to distinguish positive domain data from negative domain data. Then, we filter the positive domain data from domain-unlabeled data by applying classifier $f$ to the domain-unlabeled dataset. Finally, we train a policy using off-the-shelf offline RL methods with the augmented dataset.