Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

Nikhil Kumar Singh; Indranil Saha

Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

Nikhil Kumar Singh, Indranil Saha

TL;DR

FAC introduces a principled method to improve sample efficiency in off-policy actor-critic RL by preserving unique experiences in the replay buffer. It combines a QR-based selection of significant state dimensions, nonuniform state-space partitioning into abstract states, and KDE-based density estimation to insert only novel state–reward experiences, achieving IID-like sampling with a smaller buffer. Theoretical analysis shows faster convergence and improved IID properties, while empirical results across nine Gym benchmarks demonstrate substantial buffer-size reductions and often higher rewards than strong baselines like SAC and TD3, as well as outperforming LABER in most cases. The approach is computationally light and broadly compatible with existing off-policy algorithms, promising practical gains for memory-constrained, real-time RL systems.

Abstract

Efficient utilization of the replay buffer plays a significant role in the off-policy actor-critic reinforcement learning (RL) algorithms used for model-free control policy synthesis for complex dynamical systems. We propose a method for achieving sample efficiency, which focuses on selecting unique samples and adding them to the replay buffer during the exploration with the goal of reducing the buffer size and maintaining the independent and identically distributed (IID) nature of the samples. Our method is based on selecting an important subset of the set of state variables from the experiences encountered during the initial phase of random exploration, partitioning the state space into a set of abstract states based on the selected important state variables, and finally selecting the experiences with unique state-reward combination by using a kernel density estimator. We formally prove that the off-policy actor-critic algorithm incorporating the proposed method for unique experience accumulation converges faster than the vanilla off-policy actor-critic algorithm. Furthermore, we evaluate our method by comparing it with two state-of-the-art actor-critic RL algorithms on several continuous control benchmarks available in the Gym environment. Experimental results demonstrate that our method achieves a significant reduction in the size of the replay buffer for all the benchmarks while achieving either faster convergent or better reward accumulation compared to the baseline algorithms.

Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

TL;DR

Abstract

Paper Structure (26 sections, 2 theorems, 15 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 2 theorems, 15 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Problem
Preliminaries
Reinforcement Learning.
Deep RL Algorithms with Experience Replay.
Problem Definition
The FAC Algorithm
Finding Significant State Dimensions
State-Space Partitioning
Density Estimation
Learning from Replay Buffers.
Computation Overhead
Theoretical Analysis
Evaluation
Experimental Setup
...and 11 more sections

Key Result

Theorem 1

Let $J(\theta,x)$ be an L-smooth convex cost function in policy parameter $\theta$ and $x \in \mathcal{R}$. Assume that the gradient $\nabla_{\theta} J(\theta,x)$ has $\sigma^2$-bounded variance for all $\theta$. Let $\zeta$ represent the number of equivalent samples in a mini-batch of size $b$ ($\z

Figures (8)

Figure 1: Plots showing comparison of epanechnikov (dark blue), tophat (dark red) and gaussian (sky blue) kernels.
Figure 2: Plots showing training for different values of $\epsilon$ - 0.1 (pink), 0.2 (dark blue), 0.3 (green), and 0.5 (light blue).
Figure 3: Plots showing training for different values of $\eta$ - 10k (light blue), 50k (pink), 100k (dark blue), and 150k (green).
Figure 4: Plots showing training for different values of $\beta$ - 0.1 (green), 0.2 (dark blue), 0.3 (gray), and 0.5 (light blue).
Figure 5: Plots showing training for different values of $\mu$ - 30 (green), 50 (dark blue), 70 (orange), and 100 (gray).
...and 3 more figures

Theorems & Definitions (3)

Definition 1: Equivalent Experience
Theorem 1: Convergence
Theorem 2

Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

TL;DR

Abstract

Frugal Actor-Critic: Sample Efficient Off-Policy Deep Reinforcement Learning Using Unique Experiences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (3)