Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

Long Chen; Yinkui Liu; Shen Li; Bo Tang; Xuemin Hu

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

Long Chen, Yinkui Liu, Shen Li, Bo Tang, Xuemin Hu

TL;DR

This work tackles offline RL’s anti-exploration challenge by transforming continuous state-action counting into latent-label counting using a multi-codebook VQVAE, thereby avoiding dimension explosion and information loss from grid discretization. A fuzzy C-means based codebook update improves vector usage, enriching the discretized representation without prohibitive cost. Pseudo-counts derived from codebook-label sequences guide a penalty for OOD data, integrated into a SAC-style training loop to balance conservatism and exploration. Experiments on D4RL confirm superior performance and lower computing cost compared to state-of-the-art methods, with strong OOD detection and robust ablations. The approach yields practical gains for offline RL by combining efficient latent discretization, accurate counting, and principled anti-exploration penalties.

Abstract

Pseudo-count is an effective anti-exploration method in offline reinforcement learning (RL) by counting state-action pairs and imposing a large penalty on rare or unseen state-action pair data. Existing anti-exploration methods count continuous state-action pairs by discretizing these data, but often suffer from the issues of dimension disaster and information loss in the discretization process, leading to efficiency and performance reduction, and even failure of policy learning. In this paper, a novel anti-exploration method based on Vector Quantized Variational Autoencoder (VQVAE) and fuzzy clustering in offline RL is proposed. We first propose an efficient pseudo-count method based on the multi-codebook VQVAE to discretize state-action pairs, and design an offline RL anti-exploitation method based on the proposed pseudo-count method to handle the dimension disaster issue and improve the learning efficiency. In addition, a codebook update mechanism based on fuzzy C-means (FCM) clustering is developed to improve the use rate of vectors in codebooks, addressing the information loss issue in the discretization process. The proposed method is evaluated on the benchmark of Datasets for Deep Data-Driven Reinforcement Learning (D4RL), and experimental results show that the proposed method performs better and requires less computing cost in multiple complex tasks compared to state-of-the-art (SOTA) methods.

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (21 sections, 16 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 16 equations, 9 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Anti-exploration Methods in Offline RL
Pseudo-count Methods
Preliminaries
Anti-Exploration Based on Pseudo-count
VQVAE
Counting Bloom Filter
Methodology
Pseudo-counting Based on Multi-codebook VQVAE
Anti-exploration Based on Pseudo-Count
Codebook Update Based on FCM Clustering
Experiments
Experimental Settings
Metrics and Baselines
...and 6 more sections

Figures (9)

Figure 1: Comparison of the dimensions after discretization between the proposed method and traditional methods. Traditional pseudo-count methods discretize the continuous space into grids for counting, where the grid number greatly increases as the input state-action dimension and the discretization level, leading to dimension disaster. We propose a multi-book VQVAE-based pseudo-counting method to discretize continuous state-action pairs to discrete vectors as well as developing a FCM-based method to update the codebook vectors, effectively reducing the data dimension and information loss after discretization.
Figure 2: Pseudo-counting based on multi-codebook VQVAE During quantization, a vector is selected from each codebook, and its corresponding vector label is obtained, forming a one-dimensional sequence of integer labels. Subsequently, a pseudo-counting is performed on this label sequence to calculate the corresponding penalty value $p(s, a)$.
Figure 3: Update process of FCM clustering. The membership degrees are calculated based on the Euclidean distances between the latent vectors and all the codebook vectors, and serve as the weights of corresponding codebook vectors in the update process, enabling joint optimization of all codebook vectors.
Figure 4: Experimental results with different codebook numbers
Figure 5: Use rates of codebook vectors with and without FCM.
...and 4 more figures

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

TL;DR

Abstract

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)