Table of Contents
Fetching ...

Deep RL With Information Constrained Policies: Generalization in Continuous Control

Tailia Malloy, Chris R. Sims, Tim Klinger, Miao Liu, Matthew Riemer, Gerald Tesauro

TL;DR

This work introduces Capacity Limited RL (CLRL), a framework that regularizes policy complexity with information-theoretic constraints to improve generalization in continuous control. By casting the policy as an information channel and penalizing mutual information via a coefficient $\beta$, the approach yields a capacity-limited objective $J(\\pi)=\\mathbb{E}[r(s_t,a_t) - \beta \mathcal{I}(\\pi(\\cdot|s_t))]$, promoting simpler, more transferable policies. The Capacity-Limited Actor-Critic (CLAC) algorithm operationalizes this idea by adapting SAC with a capacity-limited value function and MI-based policy updates, and by optionally Auto-tuning $\\beta$ to meet a target information capacity. Empirical results on Continuous N-Chain and robust robot control tasks show improved generalization under environment perturbations without sacrificing sample efficiency, highlighting potential practical benefits for real-world robotics. The work situates CLAC within MERL, KL-RL, and MIRL families, clarifying its unique emphasis on final policy simplicity and transferability under information constraints.

Abstract

Biological agents learn and act intelligently in spite of a highly limited capacity to process and store information. Many real-world problems involve continuous control, which represents a difficult task for artificial intelligence agents. In this paper we explore the potential learning advantages a natural constraint on information flow might confer onto artificial agents in continuous control tasks. We focus on the model-free reinforcement learning (RL) setting and formalize our approach in terms of an information-theoretic constraint on the complexity of learned policies. We show that our approach emerges in a principled fashion from the application of rate-distortion theory. We implement a novel Capacity-Limited Actor-Critic (CLAC) algorithm and situate it within a broader family of RL algorithms such as the Soft Actor Critic (SAC) and Mutual Information Reinforcement Learning (MIRL) algorithm. Our experiments using continuous control tasks show that compared to alternative approaches, CLAC offers improvements in generalization between training and modified test environments. This is achieved in the CLAC model while displaying the high sample efficiency of similar methods.

Deep RL With Information Constrained Policies: Generalization in Continuous Control

TL;DR

This work introduces Capacity Limited RL (CLRL), a framework that regularizes policy complexity with information-theoretic constraints to improve generalization in continuous control. By casting the policy as an information channel and penalizing mutual information via a coefficient , the approach yields a capacity-limited objective , promoting simpler, more transferable policies. The Capacity-Limited Actor-Critic (CLAC) algorithm operationalizes this idea by adapting SAC with a capacity-limited value function and MI-based policy updates, and by optionally Auto-tuning to meet a target information capacity. Empirical results on Continuous N-Chain and robust robot control tasks show improved generalization under environment perturbations without sacrificing sample efficiency, highlighting potential practical benefits for real-world robotics. The work situates CLAC within MERL, KL-RL, and MIRL families, clarifying its unique emphasis on final policy simplicity and transferability under information constraints.

Abstract

Biological agents learn and act intelligently in spite of a highly limited capacity to process and store information. Many real-world problems involve continuous control, which represents a difficult task for artificial intelligence agents. In this paper we explore the potential learning advantages a natural constraint on information flow might confer onto artificial agents in continuous control tasks. We focus on the model-free reinforcement learning (RL) setting and formalize our approach in terms of an information-theoretic constraint on the complexity of learned policies. We show that our approach emerges in a principled fashion from the application of rate-distortion theory. We implement a novel Capacity-Limited Actor-Critic (CLAC) algorithm and situate it within a broader family of RL algorithms such as the Soft Actor Critic (SAC) and Mutual Information Reinforcement Learning (MIRL) algorithm. Our experiments using continuous control tasks show that compared to alternative approaches, CLAC offers improvements in generalization between training and modified test environments. This is achieved in the CLAC model while displaying the high sample efficiency of similar methods.

Paper Structure

This paper contains 24 sections, 34 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Average reward by time step from 48 agents trained on 5 re-samplings of 50K steps through the N-chain environment with the number of states as n=5 and a Beta(10,25) distribution sampling hidden values. Shaded regions represent standard deviation. Hidden values are re-sampled after each set of 50K steps. All models used the same initial coefficient 0.5
  • Figure 2: Top: Ant walker controller task. Bottom: Double Pendulum balancing task. Left: Non-randomized training results of 1M (Ant) and 50K (Pendulum) time steps with 8 walker and 20 pendulum agents. Middle: Generalization results with environment parameters re-sampled from a uniform [95-105%] 100 or 50 times. Right: Generalization results with environment parameters re-sampled from a disjoint set [90-95%] and [105-110%] 100 or 50 times. Error bars represent standard deviation. Best performing coefficients for the non-randomized task (left) sampled from [0-2.0] in 0.05 windows and reused for all tests.
  • Figure 3: Top: Diagram of the Continuous N-Chain Learning environment. Middle: Example of a set of hidden state values. Right: Graph of the probability of moving to the next state given the absolute distance from the hidden state value and action preformed. Bottom: Function describing this state transition probability.