MM-LMPC: Multi-Modal Learning Model Predictive Control via Bandit-Based Mode Selection
Wataru Hashimoto, Kazumune Hashimoto
TL;DR
The paper tackles the limitation of Learning Model Predictive Control (LMPC) where exploration is biased toward early, low-cost trajectories, potentially missing globally superior solution modes. It proposes MM-LMPC, which clusters past trajectories into multiple modes, assigns mode-specific terminal sets and value functions, and uses a bandit-based meta-controller with a Lower Confidence Bound policy to select which mode to refine at each iteration. The authors prove recursive feasibility, stability, and asymptotic convergence to the best mode, along with a logarithmic regret bound $O(\log T)$ for mode exploration. A Dubins car reach-avoid simulation demonstrates that MM-LMPC systematically explores multiple routes and achieves lower final costs than standard LMPC, indicating improved global performance in iterative tasks.
Abstract
Learning Model Predictive Control (LMPC) improves performance on iterative tasks by leveraging data from previous executions. At each iteration, LMPC constructs a sampled safe set from past trajectories and uses it as a terminal constraint, with a terminal cost given by the corresponding cost-to-go. While effective, LMPC heavily depends on the initial trajectories: states with high cost-to-go are rarely selected as terminal candidates in later iterations, leaving parts of the state space unexplored and potentially missing better solutions. For example, in a reach-avoid task with two possible routes, LMPC may keep refining the initially shorter path while neglecting the alternative path that could lead to a globally better solution. To overcome this limitation, we propose Multi-Modal LMPC (MM-LMPC), which clusters past trajectories into modes and maintains mode-specific terminal sets and value functions. A bandit-based meta-controller with a Lower Confidence Bound (LCB) policy balances exploration and exploitation across modes, enabling systematic refinement of all modes. This allows MM-LMPC to escape high-cost local optima and discover globally superior solutions. We establish recursive feasibility, closed-loop stability, asymptotic convergence to the best mode, and a logarithmic regret bound. Simulations on obstacle-avoidance tasks validate the performance improvements of the proposed method.
