Offline Reinforcement Learning with Domain-Unlabeled Data
Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, Masashi Sugiyama
TL;DR
The paper tackles offline reinforcement learning when data come from multiple domains with shared state and action spaces but different dynamics, by introducing Positive-Unlabeled Offline RL (PUORL). It proposes a two-stage, plug-and-play approach that uses positive-unlabeled learning to train a domain classifier, filters unlabeled data to augment the small target-domain dataset, and then applies standard offline RL methods to the augmented data. Empirical results on a dynamics-shifted D4RL benchmark show strong policy performance with limited labeled data and high PU classifier accuracy across shifts, demonstrating robustness to domain differences. The work provides a practical method to leverage large amounts of domain-unlabeled data for offline RL and points to future extensions to reward-shift and broader weakly supervised learning scenarios.
Abstract
Offline reinforcement learning (RL) is vital in areas where active data collection is expensive or infeasible, such as robotics or healthcare. In the real world, offline datasets often involve multiple domains that share the same state and action spaces but have distinct dynamics, and only a small fraction of samples are clearly labeled as belonging to the target domain we are interested in. For example, in robotics, precise system identification may only have been performed for part of the deployments. To address this challenge, we consider Positive-Unlabeled Offline RL (PUORL), a novel offline RL setting in which we have a small amount of labeled target-domain data and a large amount of domain-unlabeled data from multiple domains, including the target domain. For PUORL, we propose a plug-and-play approach that leverages positive-unlabeled (PU) learning to train a domain classifier. The classifier then extracts target-domain samples from the domain-unlabeled data, augmenting the scarce target-domain data. Empirical results on a modified version of the D4RL benchmark demonstrate the effectiveness of our method: even when only 1 to 3 percent of the dataset is domain-labeled, our approach accurately identifies target-domain samples and achieves high performance, even under substantial dynamics shift. Our plug-and-play algorithm seamlessly integrates PU learning with existing offline RL pipelines, enabling effective multi-domain data utilization in scenarios where comprehensive domain labeling is prohibitive.
