Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering
Liang Zhang, Justin Lieffers, Adarsh Pyarelal
TL;DR
The paper addresses the interpretability gap in deep reinforcement learning by revealing the internal semantic organization of states through semantic clustering. It introduces an end-to-end Semantic Clustering Module that fuses a Feature Dimensionality Reduction (FDR) network with an online VQ-VAE–based clustering mechanism, integrated into PPO, and trains with a total objective $L_{total} = L_{DRL} + \lambda_{ctrl} ( w_{FDR} \mathcal{L}_{FDR} + w_{VQ-VAE} \mathcal{L}'_{VQ-VAE} )$ to stabilize the low-dimensional mapping and centroids. The approach yields stable, well-separated semantic clusters in the DRL feature space, enables meaningful cluster descriptions and human evaluation, and provides tools for analyzing hierarchical policy structure without sacrificing performance. Through experiments on Procgen environments, the method demonstrates substantial interpretability improvements and supports downstream tasks such as behavior summarization and macro-action considerations, with code available at the provided repository.
Abstract
In this paper, we explore semantic clustering properties of deep reinforcement learning (DRL) to improve its interpretability and deepen our understanding of its internal semantic organization. In this context, semantic clustering refers to the ability of neural networks to cluster inputs based on their semantic similarity in the feature space. We propose a DRL architecture that incorporates a novel semantic clustering module that combines feature dimensionality reduction with online clustering. This module integrates seamlessly into the DRL training pipeline, addressing the instability of t-SNE and eliminating the need for extensive manual annotation inherent to prior semantic analysis methods. We experimentally validate the effectiveness of the proposed module and demonstrate its ability to reveal semantic clustering properties within DRL. Furthermore, we introduce new analytical methods based on these properties to provide insights into the hierarchical structure of policies and semantic organization within the feature space. Our code is available at https://github.com/ualiangzhang/semantic_rl.
