Table of Contents
Fetching ...

Value constrained model-free continuous control

Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, Raia Hadsell

TL;DR

The paper tackles the problem of undesirable bang-bang control in continuous-action RL by introducing a constraint-based framework that uses Lagrangian relaxation to minimize auxiliary costs while guaranteeing task success. It formalizes constrained MDPs, derives a learnable dual with state-dependent multipliers, and extends constraints to per-state and conditional forms to enable dynamic reward-cost trade-offs and Pareto-front exploration. The approach is instantiated with MPO and Retrace in off-policy settings and validated across DM Control benchmarks, a Minitaur locomotion task, and a real Sawyer reaching task, demonstrating smoother, energy-efficient behavior without sacrificing performance. This results in a practical, scalable method for robust, real-world continuous control that can adapt to multiple tasks and constraints without manual hyperparameter tuning of reward-cost weights.

Abstract

The naive application of Reinforcement Learning algorithms to continuous control problems -- such as locomotion and manipulation -- often results in policies which rely on high-amplitude, high-frequency control signals, known colloquially as bang-bang control. Although such solutions may indeed maximize task reward, they can be unsuitable for real world systems. Bang-bang control may lead to increased wear and tear or energy consumption, and tends to excite undesired second-order dynamics. To counteract this issue, multi-objective optimization can be used to simultaneously optimize both the reward and some auxiliary cost that discourages undesired (e.g. high-amplitude) control. In principle, such an approach can yield the sought after, smooth, control policies. It can, however, be hard to find the correct trade-off between cost and return that results in the desired behavior. In this paper we propose a new constraint-based reinforcement learning approach that ensures task success while minimizing one or more auxiliary costs (such as control effort). We employ Lagrangian relaxation to learn both (a) the parameters of a control policy that satisfies the desired constraints and (b) the Lagrangian multipliers for the optimization. Moreover, we demonstrate that we can satisfy constraints either in expectation or in a per-step fashion, and can even learn a single policy that is able to dynamically trade-off between return and cost. We demonstrate the efficacy of our approach using a number of continuous control benchmark tasks, a realistic, energy-optimized quadruped locomotion task, as well as a reaching task on a real robot arm.

Value constrained model-free continuous control

TL;DR

The paper tackles the problem of undesirable bang-bang control in continuous-action RL by introducing a constraint-based framework that uses Lagrangian relaxation to minimize auxiliary costs while guaranteeing task success. It formalizes constrained MDPs, derives a learnable dual with state-dependent multipliers, and extends constraints to per-state and conditional forms to enable dynamic reward-cost trade-offs and Pareto-front exploration. The approach is instantiated with MPO and Retrace in off-policy settings and validated across DM Control benchmarks, a Minitaur locomotion task, and a real Sawyer reaching task, demonstrating smoother, energy-efficient behavior without sacrificing performance. This results in a practical, scalable method for robust, real-world continuous control that can adapt to multiple tasks and constraints without manual hyperparameter tuning of reward-cost weights.

Abstract

The naive application of Reinforcement Learning algorithms to continuous control problems -- such as locomotion and manipulation -- often results in policies which rely on high-amplitude, high-frequency control signals, known colloquially as bang-bang control. Although such solutions may indeed maximize task reward, they can be unsuitable for real world systems. Bang-bang control may lead to increased wear and tear or energy consumption, and tends to excite undesired second-order dynamics. To counteract this issue, multi-objective optimization can be used to simultaneously optimize both the reward and some auxiliary cost that discourages undesired (e.g. high-amplitude) control. In principle, such an approach can yield the sought after, smooth, control policies. It can, however, be hard to find the correct trade-off between cost and return that results in the desired behavior. In this paper we propose a new constraint-based reinforcement learning approach that ensures task success while minimizing one or more auxiliary costs (such as control effort). We employ Lagrangian relaxation to learn both (a) the parameters of a control policy that satisfies the desired constraints and (b) the Lagrangian multipliers for the optimization. Moreover, we demonstrate that we can satisfy constraints either in expectation or in a per-step fashion, and can even learn a single policy that is able to dynamically trade-off between return and cost. We demonstrate the efficacy of our approach using a number of continuous control benchmark tasks, a realistic, energy-optimized quadruped locomotion task, as well as a reaching task on a real robot arm.

Paper Structure

This paper contains 15 sections, 14 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The continuous control environments used in the experiments. Cart-pole swingup (\ref{['fig:cartpole_sim']}) and humanoid stand and walk (\ref{['fig:humanoid_sim']}) are from the DM control suite Tassa2018ControlSuite. The Minitaur robot (\ref{['fig:minitaur_sim']}) is similarly simulated in MuJoCo. Finally, we use a real Sawyer robot (\ref{['fig:sawyer']}) for a reaching task.
  • Figure 2: Representative results of the executed policies in the control benchmark tasks. Plot (\ref{['fig:cartpole_comparison']}) shows a representative rollout of the (1-dimensional) policy trained on cart-pole swingup in the unconstrained, constrained and original reward setting. In all three cases, we observe high control input during the first 2 seconds, corresponding to the swingup phase. Figure (\ref{['fig:humanoid_stand_comparison']}) shows the control norm during the episode rollout of policies trained in humanoid stand. Note that in both tasks the actual return between the thee methods is almost identical.
  • Figure 3: Comparison of policies trained on the humanoid stand task in the constrained, unconstrained and original reward setup. Figures show the average frame of the final 50% of the episode. Policies that exhibit more bang-bang-style control will result in more jittering movements and hence more blurry images.
  • Figure 4: Comparison of a single versus a state-dependent $\lambda$ multiplier for models trained to achieve a minimum velocity of 0.5 m/s. A single multiplier results in large swings in reward and on average higher values of $\lambda$. (\ref{['fig:minitaur_reward_time']}) shows the per-step reward over time. (\ref{['fig:minitaur_reward_penalty']}) shows the trade-off between the per-step reward and penalty during training. Policies start off at 0 m/s and first learn to satisfy the constraint before optimizing the penalty. (\ref{['fig:minitaur_lambda_time']}) shows the Lagrangian multiplier(s) change over time. For the state-dependent case, we show the mean and standard deviation of $\lambda$ across the training batch.
  • Figure 5: Comparison of the constrained optimization approach with baselines using a fixed penalty. Each data point shows the average absolute velocity delta and penalty for an agent optimized for a specific target velocity. The different ellipse shades show one to three standard deviations, both for the fixed (red) and the varying (blue) velocity setpoints. For each setting we train four agents. In the fixed target case, these are different models. In the conditional target case, these are evaluations of a single model conditioned on desired velocities.
  • ...and 2 more figures