Goal-Conditioned Terminal Value Estimation for Real-time and Multi-task Model Predictive Control

Mitsuki Morita; Satoshi Yamamori; Satoshi Yagi; Norikazu Sugimoto; Jun Morimoto

Goal-Conditioned Terminal Value Estimation for Real-time and Multi-task Model Predictive Control

Mitsuki Morita, Satoshi Yamamori, Satoshi Yagi, Norikazu Sugimoto, Jun Morimoto

TL;DR

This paper addresses the computational challenge of real-time Model Predictive Control (MPC) for high-dimensional robots by introducing goal-conditioned terminal value learning. It presents a two-layer hierarchical framework where the lower-layer MPC uses terminal values $\,\hat{V}_{\theta}(\boldsymbol{x}, \boldsymbol{g})$ conditioned on goals, while the upper layer generates time-varying goal trajectories to steer behavior, enabling multitask control. The approach is augmented with domain randomization and a surrogate robot model to improve robustness and speed, and is demonstrated on a simulated bipedal inverted pendulum (Diablo) performing lemniscate path tracking on flat and sloped terrain, achieving real-time performance within a $10$ ms control cycle and comparable performance to longer-horizon MPC. The results show that terminal values learned across varied goals enable diverse, smooth motions and robust operation under terrain changes, highlighting the practical potential for real-time, multitask MPC in dynamic environments.

Abstract

While MPC enables nonlinear feedback control by solving an optimal control problem at each timestep, the computational burden tends to be significantly large, making it difficult to optimize a policy within the control period. To address this issue, one possible approach is to utilize terminal value learning to reduce computational costs. However, the learned value cannot be used for other tasks in situations where the task dynamically changes in the original MPC setup. In this study, we develop an MPC framework with goal-conditioned terminal value learning to achieve multitask policy optimization while reducing computational time. Furthermore, by using a hierarchical control structure that allows the upper-level trajectory planner to output appropriate goal-conditioned trajectories, we demonstrate that a robot model is able to generate diverse motions. We evaluate the proposed method on a bipedal inverted pendulum robot model and confirm that combining goal-conditioned terminal value learning with an upper-level trajectory planner enables real-time control; thus, the robot successfully tracks a target trajectory on sloped terrain.

Goal-Conditioned Terminal Value Estimation for Real-time and Multi-task Model Predictive Control

TL;DR

conditioned on goals, while the upper layer generates time-varying goal trajectories to steer behavior, enabling multitask control. The approach is augmented with domain randomization and a surrogate robot model to improve robustness and speed, and is demonstrated on a simulated bipedal inverted pendulum (Diablo) performing lemniscate path tracking on flat and sloped terrain, achieving real-time performance within a

ms control cycle and comparable performance to longer-horizon MPC. The results show that terminal values learned across varied goals enable diverse, smooth motions and robust operation under terrain changes, highlighting the practical potential for real-time, multitask MPC in dynamic environments.

Abstract

Paper Structure (25 sections, 17 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 25 sections, 17 equations, 14 figures, 5 tables, 2 algorithms.

Introduction
Related works
Real-time Robot Control Using Model Predictive Control (MPC)
Goal-Conditioned Reinforcement Learning for Robot Control
Preliminaries
Reinforcement Learning
Value Function Approximation
Model Predictive Control (MPC)
Method
Lower Layer
Upper Layer
Domain Randomization for Robust Value Function Estimation
Faster MPC Calculation with a Surrogate Robot Model
Experiments
The Bipedal Inverted Pendulum Robot
...and 10 more sections

Figures (14)

Figure 1: Schematic diagram of the proposed method. In the training phase, target goals $\boldsymbol{g}_\mathrm{target}$ are sampled from a predefined distribution, and a goal-conditioned value function $\hat{V}_{\theta}(\boldsymbol{x}_t, \boldsymbol{g}_{\mathrm{target}})$ is learned using MPC solutions. In the inference phase, the goal trajectory $\boldsymbol{g}^{\mathrm{plan}}_{t+H}$ is generated through planning for the MPC horizon. By inputting this trajectory into the MPC, the system can adapt to environmental changes and smoothly adjust its behavior. Here, $\boldsymbol{x}_t$, $\boldsymbol{u}_t$, $\boldsymbol{c}_t$, and $\boldsymbol{f}$ represent the state, control input, cost function, and system dynamics, respectively.
Figure 2: Concrete implementation to control the bipedal inverted pendulum system. Upper layer: During training, it outputs random goal variables. During inference, it generates a sequence of goal variables, $\boldsymbol{g}_{0 }$, of the same length as the prediction horizon, tailored to the robot's state and reference trajectory (e.g., desired robot position and velocity). Lower layer: learns terminal values corresponding to the goal variables. During inference, it optimizes the predicted trajectory and generates actions in accordance with the commands received from the upper layer, using the learned terminal values. $\boldsymbol{x_t}$, $\boldsymbol{u_t}$, $v$, $\omega$, and $\alpha$ represent the state, control input, robot's linear velocity, angular velocity, and slope angle, respectively.
Figure 3: Simulated robot model. $q_1$ and $q_2$ represent joint angles, and $\omega_{\mathrm{wheel}}$ denotes the angular velocities of the wheels.
Figure 4: Terrain and lemniscate trajectory for the tracking task.
Figure 5: The robot learns to traverse uneven terrain, making the driving module more robust.
...and 9 more figures

Goal-Conditioned Terminal Value Estimation for Real-time and Multi-task Model Predictive Control

TL;DR

Abstract

Goal-Conditioned Terminal Value Estimation for Real-time and Multi-task Model Predictive Control

Authors

TL;DR

Abstract

Table of Contents

Figures (14)