Deep RL With Information Constrained Policies: Generalization in Continuous Control
Tailia Malloy, Chris R. Sims, Tim Klinger, Miao Liu, Matthew Riemer, Gerald Tesauro
TL;DR
This work introduces Capacity Limited RL (CLRL), a framework that regularizes policy complexity with information-theoretic constraints to improve generalization in continuous control. By casting the policy as an information channel and penalizing mutual information via a coefficient $\beta$, the approach yields a capacity-limited objective $J(\\pi)=\\mathbb{E}[r(s_t,a_t) - \beta \mathcal{I}(\\pi(\\cdot|s_t))]$, promoting simpler, more transferable policies. The Capacity-Limited Actor-Critic (CLAC) algorithm operationalizes this idea by adapting SAC with a capacity-limited value function and MI-based policy updates, and by optionally Auto-tuning $\\beta$ to meet a target information capacity. Empirical results on Continuous N-Chain and robust robot control tasks show improved generalization under environment perturbations without sacrificing sample efficiency, highlighting potential practical benefits for real-world robotics. The work situates CLAC within MERL, KL-RL, and MIRL families, clarifying its unique emphasis on final policy simplicity and transferability under information constraints.
Abstract
Biological agents learn and act intelligently in spite of a highly limited capacity to process and store information. Many real-world problems involve continuous control, which represents a difficult task for artificial intelligence agents. In this paper we explore the potential learning advantages a natural constraint on information flow might confer onto artificial agents in continuous control tasks. We focus on the model-free reinforcement learning (RL) setting and formalize our approach in terms of an information-theoretic constraint on the complexity of learned policies. We show that our approach emerges in a principled fashion from the application of rate-distortion theory. We implement a novel Capacity-Limited Actor-Critic (CLAC) algorithm and situate it within a broader family of RL algorithms such as the Soft Actor Critic (SAC) and Mutual Information Reinforcement Learning (MIRL) algorithm. Our experiments using continuous control tasks show that compared to alternative approaches, CLAC offers improvements in generalization between training and modified test environments. This is achieved in the CLAC model while displaying the high sample efficiency of similar methods.
