Table of Contents
Fetching ...

Residual-MPPI: Online Policy Customization for Continuous Control

Pengcheng Wang, Chenran Li, Catherine Weaver, Kenta Kawamoto, Masayoshi Tomizuka, Chen Tang, Wei Zhan

TL;DR

Residual-MPPI tackles the challenge of adapting pretrained continuous-control policies at execution time without retraining. By integrating a log-likelihood term of the prior policy into MPPI’s online planning, it enables online customization for new objectives via zero-shot or few-shot data, using only the prior action distribution and an add-on reward. The approach is validated on MuJoCo benchmarks and a high-fidelity autonomous racing task with GT Sophy in GT Sport, showing strong gains in aligning behavior with add-on objectives while preserving basic task performance and achieving data-efficient dynamics refinement. This work advances practical policy deployment by enabling flexible, data-efficient, and real-time policy customization in continuous-control settings with limited access to original training data or rewards.

Abstract

Policies developed through Reinforcement Learning (RL) and Imitation Learning (IL) have shown great potential in continuous control tasks, but real-world applications often require adapting trained policies to unforeseen requirements. While fine-tuning can address such needs, it typically requires additional data and access to the original training metrics and parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time, which we call Residual-MPPI. It can customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings, given access to the prior action distribution alone. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Code for MuJoCo experiments is included in the supplementary and will be open-sourced upon acceptance. Demo videos and code are available on our website: https://sites.google.com/view/residual-mppi.

Residual-MPPI: Online Policy Customization for Continuous Control

TL;DR

Residual-MPPI tackles the challenge of adapting pretrained continuous-control policies at execution time without retraining. By integrating a log-likelihood term of the prior policy into MPPI’s online planning, it enables online customization for new objectives via zero-shot or few-shot data, using only the prior action distribution and an add-on reward. The approach is validated on MuJoCo benchmarks and a high-fidelity autonomous racing task with GT Sophy in GT Sport, showing strong gains in aligning behavior with add-on objectives while preserving basic task performance and achieving data-efficient dynamics refinement. This work advances practical policy deployment by enabling flexible, data-efficient, and real-time policy customization in continuous-control settings with limited access to original training data or rewards.

Abstract

Policies developed through Reinforcement Learning (RL) and Imitation Learning (IL) have shown great potential in continuous control tasks, but real-world applications often require adapting trained policies to unforeseen requirements. While fine-tuning can address such needs, it typically requires additional data and access to the original training metrics and parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time, which we call Residual-MPPI. It can customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings, given access to the prior action distribution alone. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Code for MuJoCo experiments is included in the supplementary and will be open-sourced upon acceptance. Demo videos and code are available on our website: https://sites.google.com/view/residual-mppi.
Paper Structure (37 sections, 2 theorems, 26 equations, 17 figures, 14 tables, 2 algorithms)

This paper contains 37 sections, 2 theorems, 26 equations, 17 figures, 14 tables, 2 algorithms.

Key Result

Proposition 1

Given an MDP defined by $\mathcal{M} = (\mathcal{X}, \mathcal{U}, r, p)$, with a deterministic state transition $p$ defined with respect to a dynamics model $F$ and a discount factor $\gamma = 1$, the distribution of the action sequence $q^*(U)$ at state $\boldsymbol{x}_0$ in horizon $T$, where each where $V^*$ is the soft value function Soft-Q and $\boldsymbol{x}_t$ is defined recursively from $\

Figures (17)

  • Figure 1: Overview of the proposed algorithm. In each planning loop, we utilize the prior policy to generate samples and then evaluate them through both the log likelihood of the prior policy and an add-on reward to obtain the customized actions. More details are in Sec. \ref{['sec: remppi']}. In the experiments, we demonstrate that Residual-MPPI can accomplish the online policy customization task effectively, even in a challenging GTS environment with the champion-level racing agent, GT Sophy 1.0.
  • Figure 2: In-game screen shots of Policy Behavior on Different Road Sections
  • Figure 3: Guided-MPPI and Greedy-MPPI Results in GTS. (a) In-game screenshots of Greedy-MPPI; (b) In-game screenshots of Guided-MPPI. Red parts indicate off-course behaviors. Both baselines cannot drive the vehicle effectively, completely going off track at the first corner.
  • Figure 4: (a) The angle of the Half Cheetah's hind leg vs. the environmental steps. (b) The angle of the Swimmer's first rotor vs. the environmental steps. (c) The trajectory of the Hopper robot on the $x$ and $z$ axis.(d) The trajectory of the Ant robot on the $x$ and $y$ axis.
  • Figure 5: Typical complete trajectories of all policies, where the red parts indicate off-course behaviours. (a) The trajectory of GT Sophy 1.0. It finishes the lap in $117.762$s with $93$ steps off course. (b) The trajectory of Residual-SAC. It finishes the lap in $131.078$s with $2$ steps off course. (c) The trajectory of Zero-shot MPPI. It finishes the lap in $123.551$s with $10$ steps off course. (d) The trajectory of Few-shot MPPI. It finishes the lap in $122.919$s with $4$ steps off course.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Lemma 1
  • proof
  • proof