Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering

Liang Zhang; Justin Lieffers; Adarsh Pyarelal

Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering

Liang Zhang, Justin Lieffers, Adarsh Pyarelal

TL;DR

The paper addresses the interpretability gap in deep reinforcement learning by revealing the internal semantic organization of states through semantic clustering. It introduces an end-to-end Semantic Clustering Module that fuses a Feature Dimensionality Reduction (FDR) network with an online VQ-VAE–based clustering mechanism, integrated into PPO, and trains with a total objective $L_{total} = L_{DRL} + \lambda_{ctrl} ( w_{FDR} \mathcal{L}_{FDR} + w_{VQ-VAE} \mathcal{L}'_{VQ-VAE} )$ to stabilize the low-dimensional mapping and centroids. The approach yields stable, well-separated semantic clusters in the DRL feature space, enables meaningful cluster descriptions and human evaluation, and provides tools for analyzing hierarchical policy structure without sacrificing performance. Through experiments on Procgen environments, the method demonstrates substantial interpretability improvements and supports downstream tasks such as behavior summarization and macro-action considerations, with code available at the provided repository.

Abstract

In this paper, we explore semantic clustering properties of deep reinforcement learning (DRL) to improve its interpretability and deepen our understanding of its internal semantic organization. In this context, semantic clustering refers to the ability of neural networks to cluster inputs based on their semantic similarity in the feature space. We propose a DRL architecture that incorporates a novel semantic clustering module that combines feature dimensionality reduction with online clustering. This module integrates seamlessly into the DRL training pipeline, addressing the instability of t-SNE and eliminating the need for extensive manual annotation inherent to prior semantic analysis methods. We experimentally validate the effectiveness of the proposed module and demonstrate its ability to reveal semantic clustering properties within DRL. Furthermore, we introduce new analytical methods based on these properties to provide insights into the hierarchical structure of policies and semantic organization within the feature space. Our code is available at https://github.com/ualiangzhang/semantic_rl.

Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering

TL;DR

to stabilize the low-dimensional mapping and centroids. The approach yields stable, well-separated semantic clusters in the DRL feature space, enables meaningful cluster descriptions and human evaluation, and provides tools for analyzing hierarchical policy structure without sacrificing performance. Through experiments on Procgen environments, the method demonstrates substantial interpretability improvements and supports downstream tasks such as behavior summarization and macro-action considerations, with code available at the provided repository.

Abstract

Paper Structure (47 sections, 2 theorems, 15 equations, 18 figures, 7 tables, 1 algorithm)

This paper contains 47 sections, 2 theorems, 15 equations, 18 figures, 7 tables, 1 algorithm.

Introduction
Related Work
VQ-VAE
Method
Background
Semantic Clustering Module
Loss Function Design
Simulations
Clustering Effectiveness Evaluation
Semantic Clustering in DRL
Model and Policy Analysis
Limitations and Future Work
Conclusion
Architecture, Hyperparameters, and Computational Costs
Theoretical Analysis of Loss Design
...and 32 more sections

Key Result

Theorem 1

If $\mathcal{L}_{\mathrm{FDR}}=0$, then there exists a constant such that for all $i\neq j$, Consequently, the ordering of squared distances $\|y_i-y_j\|^2$ and $\|x_i-x_j\|^2$ is identical. Moreover, if $\kappa=1$, then $d_{\mathrm t}(y_i,y_j)=d_{\mathrm t}(x_i,x_j)$ and hence $\|y_i-y_j\|^2=\|x_i-x_j\|^2$.

Figures (18)

Figure 1: Overview of our architecture. The upper segment represents the classic DRL training pipeline, while the lower segment introduces the semantic clustering module. The Feature Dimensionality Reduction (FDR) net reduces the dimensionality of state features, resulting in FDR features, which the vector quantizer then processes to generate discrete VQ codes (denoted $k$)---which represent states associated with clusters---along with the closest VQ embeddings. Subsequently, $k$ is integrated into the state feature by element-wise addition after being expanded to match the state feature dimensions, enabling conditional policy training that better supports the integration of downstream tasks.
Figure 2: Visualization of features in t-SNE and FDR spaces using PPO and our method. To enable comparison, feature colors in the t-SNE visualizations of our method correspond to the cluster colors in the FDR space, while PPO features are shown in orange due to the absence of clustering. Unlike t-SNE, which fails to produce clearly separable clusters and exhibits sensitivity to the number of states and random seeds, our method yields well-separated and stable clusters under varying conditions.
Figure 3: State examples in the Ninja FDR space and the mean images of clusters. Each dashed box contains a sequence of consecutive states assigned to the same cluster, with dotted arrows indicating their corresponding FDR feature positions. These examples demonstrate that semantically similar and temporally adjacent states are grouped into the same cluster, highlighting the learned semantic coherence. Descriptions of the state sequences in the clusters are provided in \ref{['tab:cluster_descriptions']}.
Figure 4: Three episodes from the Ninja game. States within colored dashed boxes correspond to clusters of the same colors in \ref{['fig:ninja_clusters']}. Solid gray arrows indicate omitted intermediate states from the same cluster, while ellipses represent other omitted states. These visualizations illustrate consistent semantic alignment in cluster assignments across different episodes.
Figure 5: Hover examples in the FDR space of Ninja. We observe a sub-cluster in the FDR space as an example from an zoomed-out perspective (a) and zoomed-in perspectives (b), (c), and (d). The agent is standing on the edge of a ledge. Although the scenarios of (b), (c), and (d) are different, the proposed method effectively clusters semantically consistent features together in the FDR space.
...and 13 more figures

Theorems & Definitions (4)

Theorem 1: Similarity Preservation
proof
Theorem 2: Equivalence to Online $k$‑Means
proof

Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering

TL;DR

Abstract

Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (4)