TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint
Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi
TL;DR
This work identifies persistent value overestimation in TD-MPC2 as arising from policy mismatch between planner-driven data and the learned value/policy prior, causing extrapolation errors that accumulate under function approximation. It introduces TD-M(PC)$^2$, a minimalist policy-constraint approach that constrains policy updates to stay in-distribution via a simple TD3-BC–style objective, implemented as a small modification to TD-MPC2 with no extra compute. The authors provide theoretical analysis linking policy mismatch, extrapolation error, and $H$-step lookahead suboptimality, showing how their constraint mitigates error accumulation. Empirically, TD-M(PC)$^2$ yields substantial improvements over TD-MPC2, particularly on 61-DoF humanoid tasks in HumanoidBench and DMControl benchmarks, validated by ablations that demonstrate robustness to regularization strength and the central role of conservatism in high-dimensional control.
Abstract
Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emph{persistent value overestimation}. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins, particularly in 61-DoF humanoid tasks. View qualitative results at https://darthutopian.github.io/tdmpc_square/.
