Value constrained model-free continuous control

Steven Bohez; Abbas Abdolmaleki; Michael Neunert; Jonas Buchli; Nicolas Heess; Raia Hadsell

Value constrained model-free continuous control

Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, Raia Hadsell

TL;DR

The paper tackles the problem of undesirable bang-bang control in continuous-action RL by introducing a constraint-based framework that uses Lagrangian relaxation to minimize auxiliary costs while guaranteeing task success. It formalizes constrained MDPs, derives a learnable dual with state-dependent multipliers, and extends constraints to per-state and conditional forms to enable dynamic reward-cost trade-offs and Pareto-front exploration. The approach is instantiated with MPO and Retrace in off-policy settings and validated across DM Control benchmarks, a Minitaur locomotion task, and a real Sawyer reaching task, demonstrating smoother, energy-efficient behavior without sacrificing performance. This results in a practical, scalable method for robust, real-world continuous control that can adapt to multiple tasks and constraints without manual hyperparameter tuning of reward-cost weights.

Abstract

The naive application of Reinforcement Learning algorithms to continuous control problems -- such as locomotion and manipulation -- often results in policies which rely on high-amplitude, high-frequency control signals, known colloquially as bang-bang control. Although such solutions may indeed maximize task reward, they can be unsuitable for real world systems. Bang-bang control may lead to increased wear and tear or energy consumption, and tends to excite undesired second-order dynamics. To counteract this issue, multi-objective optimization can be used to simultaneously optimize both the reward and some auxiliary cost that discourages undesired (e.g. high-amplitude) control. In principle, such an approach can yield the sought after, smooth, control policies. It can, however, be hard to find the correct trade-off between cost and return that results in the desired behavior. In this paper we propose a new constraint-based reinforcement learning approach that ensures task success while minimizing one or more auxiliary costs (such as control effort). We employ Lagrangian relaxation to learn both (a) the parameters of a control policy that satisfies the desired constraints and (b) the Lagrangian multipliers for the optimization. Moreover, we demonstrate that we can satisfy constraints either in expectation or in a per-step fashion, and can even learn a single policy that is able to dynamically trade-off between return and cost. We demonstrate the efficacy of our approach using a number of continuous control benchmark tasks, a realistic, energy-optimized quadruped locomotion task, as well as a reaching task on a real robot arm.

Value constrained model-free continuous control

TL;DR

Abstract

Value constrained model-free continuous control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)