Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations
Stefan Sylvius Wagner, Stefan Harmeling
TL;DR
This work reframes exploration in high-dimensional 3-D environments as a density-estimation problem and proposes Just Cluster It, a two-stage clustering method that performs episodic Gaussian-mixture clustering on both random and pre-trained DINO embeddings, followed by global clustering to accumulate pseudo-counts in a cluster table via a cosine similarity threshold $\kappa$. The approach yields intrinsic rewards based on inverse square roots of pseudo-counts and is demonstrated to outperform several baselines in VizDoom and Habitat, with DINO embeddings providing advantages in visually complex settings while random features can suffice in simpler ones. The key contributions include (i) a practical episodic-plus-global clustering pipeline, (ii) empirical evidence that clustering representations can effectively approximate state-space density in 3-D observations, and (iii) a demonstration that pre-trained biases can enhance exploration in complex environments, along with analyses of the effects of $\kappa$, episodic clustering, and robustness to noise. The work highlights a pathway for leveraging pre-trained representations to guide exploration in sparse-reward, high-dimensional domains and offers a scalable clustering-based alternative to prediction-error-based methods.
Abstract
In this paper we adopt a representation-centric perspective on exploration in reinforcement learning, viewing exploration fundamentally as a density estimation problem. We investigate the effectiveness of clustering representations for exploration in 3-D environments, based on the observation that the importance of pixel changes between transitions is less pronounced in 3-D environments compared to 2-D environments, where pixel changes between transitions are typically distinct and significant. We propose a method that performs episodic and global clustering on random representations and on pre-trained DINO representations to count states, i.e, estimate pseudo-counts. Surprisingly, even random features can be clustered effectively to count states in 3-D environments, however when these become visually more complex, pre-trained DINO representations are more effective thanks to the pre-trained inductive biases in the representations. Overall, this presents a pathway for integrating pre-trained biases into exploration. We evaluate our approach on the VizDoom and Habitat environments, demonstrating that our method surpasses other well-known exploration methods in these settings.
