Table of Contents
Fetching ...

Dataset Clustering for Improved Offline Policy Learning

Qiang Wang, Yixin Deng, Francisco Roldan Sanchez, Keru Wang, Kevin McGuinness, Noel O'Connor, Stephen J. Redmond

TL;DR

The paper tackles offline policy learning with multi-behavior datasets by introducing a behavior-aware deep clustering pipeline that partitions data into uni-behavior subsets. It crucially relies on a long-horizon feature, TAAT, to reveal distinct behavioral regions and uses a positive-unlabelled filtering loop to iteratively extract uni-behavior clusters without predefined cluster counts. Empirical results across locomotion and manipulation tasks show near-perfect clustering with an average ARI of $0.987$, and policy learning from clustered subsets can outperform using the full multi-behavior data, demonstrating practical value for data-efficient offline RL. The approach is extensible to ensembles and multi-task settings, though computational cost and terminal-state signaling remain areas for improvement.

Abstract

Offline policy learning aims to discover decision-making policies from previously-collected datasets without additional online interactions with the environment. As the training dataset is fixed, its quality becomes a crucial determining factor in the performance of the learned policy. This paper studies a dataset characteristic that we refer to as multi-behavior, indicating that the dataset is collected using multiple policies that exhibit distinct behaviors. In contrast, a uni-behavior dataset would be collected solely using one policy. We observed that policies learned from a uni-behavior dataset typically outperform those learned from multi-behavior datasets, despite the uni-behavior dataset having fewer examples and less diversity. Therefore, we propose a behavior-aware deep clustering approach that partitions multi-behavior datasets into several uni-behavior subsets, thereby benefiting downstream policy learning. Our approach is flexible and effective; it can adaptively estimate the number of clusters while demonstrating high clustering accuracy, achieving an average Adjusted Rand Index of 0.987 across various continuous control task datasets. Finally, we present improved policy learning examples using dataset clustering and discuss several potential scenarios where our approach might benefit the offline policy learning community.

Dataset Clustering for Improved Offline Policy Learning

TL;DR

The paper tackles offline policy learning with multi-behavior datasets by introducing a behavior-aware deep clustering pipeline that partitions data into uni-behavior subsets. It crucially relies on a long-horizon feature, TAAT, to reveal distinct behavioral regions and uses a positive-unlabelled filtering loop to iteratively extract uni-behavior clusters without predefined cluster counts. Empirical results across locomotion and manipulation tasks show near-perfect clustering with an average ARI of , and policy learning from clustered subsets can outperform using the full multi-behavior data, demonstrating practical value for data-efficient offline RL. The approach is extensible to ensembles and multi-task settings, though computational cost and terminal-state signaling remain areas for improvement.

Abstract

Offline policy learning aims to discover decision-making policies from previously-collected datasets without additional online interactions with the environment. As the training dataset is fixed, its quality becomes a crucial determining factor in the performance of the learned policy. This paper studies a dataset characteristic that we refer to as multi-behavior, indicating that the dataset is collected using multiple policies that exhibit distinct behaviors. In contrast, a uni-behavior dataset would be collected solely using one policy. We observed that policies learned from a uni-behavior dataset typically outperform those learned from multi-behavior datasets, despite the uni-behavior dataset having fewer examples and less diversity. Therefore, we propose a behavior-aware deep clustering approach that partitions multi-behavior datasets into several uni-behavior subsets, thereby benefiting downstream policy learning. Our approach is flexible and effective; it can adaptively estimate the number of clusters while demonstrating high clustering accuracy, achieving an average Adjusted Rand Index of 0.987 across various continuous control task datasets. Finally, we present improved policy learning examples using dataset clustering and discuss several potential scenarios where our approach might benefit the offline policy learning community.
Paper Structure (44 sections, 6 equations, 10 figures, 5 tables)

This paper contains 44 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Illustration showing the distribution of data points in Euclidean space, both with and without TAAT, on the Halfcheetah task's multi-behavior dataset. For the left plot, we randomly sample 3,000 action vectors from the entire dataset. For the right plot, we average each trajectory to obtain 3,000 TAAT vectors. We used t-SNE t-SNE to reduce the dimensionality of the vectors to 3D for visualization.
  • Figure 2: Illustration of the flowchart for our iterative clustering algorithm. The purple box contains the raw multi-behavior dataset. The blue box represents the extraction of uni-behavior seed subsets for training the subsequent positively-unlabelled (PU) filter. Notably, the blue box only extracts the action sequences from trajectories, and we need to retrieve the corresponding state sequence of the trajectory from the purple box based on the corresponding indexes. The red box represents one clustering iteration, including training and using the PU filter for clustering, updating the original multi-behavior dataset by removing the resulting cluster from it, as well as checking whether to terminate the clustering process at this stage. Finally, the yellow box shows the results of uni-behavior clusters.
  • Figure 3: Illustration of the performance of agents trained using multi-behavior and uni-behavior datasets with different algorithms on various tasks. Each data point on the plots represents evaluation results from 5 episodes, and the scores are normalized using: $score_{norm} = ({score-score_{min}})/({score_{max}-score_{min}})$.
  • Figure 4: The ratio of $\delta_{\text{same}}/\delta_{\text{diff}}$ across a range of percentile values in ten distinct datasets spanning three different benchmark suites.
  • Figure 5: Mean pairwise Euclidean distances filtered using a percentile threshold of 5%. These distances are computed between actions within the same uni-behavior dataset (represented by the numbers on the diagonal) and between actions from different uni-behavior datasets (represented by the numbers off the diagonal).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 3.1