Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations

Stefan Sylvius Wagner; Stefan Harmeling

Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations

Stefan Sylvius Wagner, Stefan Harmeling

TL;DR

This work reframes exploration in high-dimensional 3-D environments as a density-estimation problem and proposes Just Cluster It, a two-stage clustering method that performs episodic Gaussian-mixture clustering on both random and pre-trained DINO embeddings, followed by global clustering to accumulate pseudo-counts in a cluster table via a cosine similarity threshold $\kappa$. The approach yields intrinsic rewards based on inverse square roots of pseudo-counts and is demonstrated to outperform several baselines in VizDoom and Habitat, with DINO embeddings providing advantages in visually complex settings while random features can suffice in simpler ones. The key contributions include (i) a practical episodic-plus-global clustering pipeline, (ii) empirical evidence that clustering representations can effectively approximate state-space density in 3-D observations, and (iii) a demonstration that pre-trained biases can enhance exploration in complex environments, along with analyses of the effects of $\kappa$, episodic clustering, and robustness to noise. The work highlights a pathway for leveraging pre-trained representations to guide exploration in sparse-reward, high-dimensional domains and offers a scalable clustering-based alternative to prediction-error-based methods.

Abstract

In this paper we adopt a representation-centric perspective on exploration in reinforcement learning, viewing exploration fundamentally as a density estimation problem. We investigate the effectiveness of clustering representations for exploration in 3-D environments, based on the observation that the importance of pixel changes between transitions is less pronounced in 3-D environments compared to 2-D environments, where pixel changes between transitions are typically distinct and significant. We propose a method that performs episodic and global clustering on random representations and on pre-trained DINO representations to count states, i.e, estimate pseudo-counts. Surprisingly, even random features can be clustered effectively to count states in 3-D environments, however when these become visually more complex, pre-trained DINO representations are more effective thanks to the pre-trained inductive biases in the representations. Overall, this presents a pathway for integrating pre-trained biases into exploration. We evaluate our approach on the VizDoom and Habitat environments, demonstrating that our method surpasses other well-known exploration methods in these settings.

Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations

TL;DR

. The approach yields intrinsic rewards based on inverse square roots of pseudo-counts and is demonstrated to outperform several baselines in VizDoom and Habitat, with DINO embeddings providing advantages in visually complex settings while random features can suffice in simpler ones. The key contributions include (i) a practical episodic-plus-global clustering pipeline, (ii) empirical evidence that clustering representations can effectively approximate state-space density in 3-D observations, and (iii) a demonstration that pre-trained biases can enhance exploration in complex environments, along with analyses of the effects of

, episodic clustering, and robustness to noise. The work highlights a pathway for leveraging pre-trained representations to guide exploration in sparse-reward, high-dimensional domains and offers a scalable clustering-based alternative to prediction-error-based methods.

Abstract

Paper Structure (40 sections, 1 equation, 16 figures, 6 tables)

This paper contains 40 sections, 1 equation, 16 figures, 6 tables.

Introduction
The need for state aggregation in 3-D vs. 2-D.
Exploration through Episodic and Global Clustering of High-Dimensional Representations.
Background
Count based methods and estimating pseudo-counts.
Novelty via prediction error.
Density estimation via clustering representations.
Just Cluster It: Exploration via Episodic and Global Clustering
Notation and Preliminaries.
Experiments
Environments.
Experimental setup and baselines.
Evaluation metrics.
VizDoom Results---The Suprising Effectiveness of Random Features
Habitat Results---Pre-trained features help in complex observations
...and 25 more sections

Figures (16)

Figure 1: "Just Cluster It" in a nutshell:(a) Density estimation for exploration in 3-D environments is challenging, because the magnitude of pixel change is large, but the saliency of a single transition is low. This is in contrast to 2-D environments where every transition is distinct and therefore salient. Therefore, we propose clustering representations on an episodic and global level to estimate accurate pseudo-counts that reflect the state-space distribution. (b) We show pre-trained representations from a DINO model. The embeddings from the DINO model are able to extract relevant features from the observations, which is useful for clustering. The bottom images are thresholded embeddings. (c) To estimate pseudo-counts, we store episodic cluster centers in a global cluster table over time by matching episodic cluster centers with previously added cluster centers in the cluster table. A new cluster center is added to the table, if the cosine similarity to existing cluster centers is below a threshold $\kappa$.
Figure 2: Clustering random features is effective for VizDoom: Our method with clustering outperforms other traditional exploration methods. Interestingly, random features perform slightly better than the DINO features for VizDoom. While the complexity of the observations is not trivial, in this setting the priors in the pre-trained representations seem to be not more informative than random features. This also shows that in 3-D high-dimensional environments random features are salient enough when clustered to estimate pseudo-counts.
Figure 3: Pre-trained DINO features excel with more complex observations: When training on the Habitat environment we see that clustering with DINO embeddings is more effective than clustering with random features. Furthermore, only our method is able to leverage the pre-trained DINO features effectively. We argue that the priors present in the DINO embeddings help in agreggating the representations when building the episodic clusters, which ultimately determine the pseudo-counts.
Figure 4: Ablation studies: We show the effectiveness of clustering, especially episodic clustering:(a) Increasing the cosine similarity threshold $\kappa$ increases the number of global clusters in the cluster table. If the number of clusters in the cluster table are too granular, performance decreases as shown by lower visitation counts for higher values of $\kappa$. (b) Episodic clustering helps whenever there is structure in the representations. Especially for DINO in Habitat performance drops significantly, if episodic clustering is removed. This can also be seen in VizDoom (left panels) where episodic clustering helps both DINO and random features. For Habitat (right panels) with random features, episodic clustering has no large effect since the random features are not expressive enough for the complex observations.
Figure 5: Visualization of DINO clusters for observations with random noise: We plot different samples from the cluster table for the noisy observations in Habitat. Each image represents the mean image of an episodic cluster that was aggregated to the cluster table. The clustering happens on the 384-dimensional embedding. Even with the concatenated random noise the embeddings are still clustered sensibly.
...and 11 more figures

Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations

TL;DR

Abstract

Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (16)