Table of Contents
Fetching ...

Maximum Total Correlation Reinforcement Learning

Bang You, Puze Liu, Huaping Liu, Jan Peters, Oleg Arenz

TL;DR

This work addresses brittleness in reinforcement learning by introducing a trajectory-level information bias: maximizing the total correlation within induced state-action sequences. The authors formulate Maximum Total Correlation RL (MTC-RL), derive a variational lower bound to enable practical optimization, and implement it atop Soft Actor-Critic with an adaptive information-weighting coefficient. Empirically, MTC-RL yields more compressible and predictable trajectories, improving robustness to observation and action noise as well as dynamics changes, while maintaining or improving task performance across locomotion, manipulation, and image-based control benchmarks. The results suggest trajectory-level regularization as a principled, generalizable approach to enhance robustness and generalization in continuous-control agents.

Abstract

Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the episode. To that end, we introduce a modification of the reinforcement learning problem that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower-bound approximation. In simulated robot environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.

Maximum Total Correlation Reinforcement Learning

TL;DR

This work addresses brittleness in reinforcement learning by introducing a trajectory-level information bias: maximizing the total correlation within induced state-action sequences. The authors formulate Maximum Total Correlation RL (MTC-RL), derive a variational lower bound to enable practical optimization, and implement it atop Soft Actor-Critic with an adaptive information-weighting coefficient. Empirically, MTC-RL yields more compressible and predictable trajectories, improving robustness to observation and action noise as well as dynamics changes, while maintaining or improving task performance across locomotion, manipulation, and image-based control benchmarks. The results suggest trajectory-level regularization as a principled, generalizable approach to enhance robustness and generalization in continuous-control agents.

Abstract

Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the episode. To that end, we introduce a modification of the reinforcement learning problem that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower-bound approximation. In simulated robot environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.

Paper Structure

This paper contains 45 sections, 23 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: We evaluated the robustness towards observation noise (left), action noise (middle) and mass changes (right) on eight tasks from DMC benchmarks. The plots show the normalized mean rewards averaged over 20 independent runs and 8 tasks, with error bars representing 90% confidence interval. For each task we normalized the return by the mean return of the best method. Each run includes 30 evaluation trajectories. MTC achieves better aggregated performance than baselines in the presence of perturbations to observations and actions, while also obtaining higher mean rewards when the body mass is changed slightly.
  • Figure 2: The compressed state-action trajectories obtained by MTC have smallest file size in expectation.
  • Figure 3: Left: aggregated performance of MTC and baselines at 500K environment steps on six image-based DMC tasks. The plot shows the normalized average rewards over 5 runs and 6 tasks, with error bars representing 90% confidence interval. For each run, we collect 10 evaluation episodes. MTC achieves better performance than baselines. Right: performance of our method and baselines on three manipulation tasks from Metaworld. The curves represent the average success rate over 10 different runs, with 90% confidence interval. Each run collects 10 evaluation episodes. MTC is competitive to baselines.
  • Figure 4: We test the robustness of MTC and its two ablations, MTC-NoA and SAC, on the Walker Stand task and the Cheetah Run task. Overall, MTC achieves better or at least comparable average rewards in the presence of observation noise (left column), action noise (middle column), and mass changes (right column) than its ablations.
  • Figure 5: We evaluate our method on eight image-based DMC tasks.
  • ...and 7 more figures