Table of Contents
Fetching ...

TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

Haotian Lin, Pengcheng Wang, Jeff Schneider, Guanya Shi

TL;DR

This work identifies persistent value overestimation in TD-MPC2 as arising from policy mismatch between planner-driven data and the learned value/policy prior, causing extrapolation errors that accumulate under function approximation. It introduces TD-M(PC)$^2$, a minimalist policy-constraint approach that constrains policy updates to stay in-distribution via a simple TD3-BC–style objective, implemented as a small modification to TD-MPC2 with no extra compute. The authors provide theoretical analysis linking policy mismatch, extrapolation error, and $H$-step lookahead suboptimality, showing how their constraint mitigates error accumulation. Empirically, TD-M(PC)$^2$ yields substantial improvements over TD-MPC2, particularly on 61-DoF humanoid tasks in HumanoidBench and DMControl benchmarks, validated by ablations that demonstrate robustness to regularization strength and the central role of conservatism in high-dimensional control.

Abstract

Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emph{persistent value overestimation}. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins, particularly in 61-DoF humanoid tasks. View qualitative results at https://darthutopian.github.io/tdmpc_square/.

TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint

TL;DR

This work identifies persistent value overestimation in TD-MPC2 as arising from policy mismatch between planner-driven data and the learned value/policy prior, causing extrapolation errors that accumulate under function approximation. It introduces TD-M(PC), a minimalist policy-constraint approach that constrains policy updates to stay in-distribution via a simple TD3-BC–style objective, implemented as a small modification to TD-MPC2 with no extra compute. The authors provide theoretical analysis linking policy mismatch, extrapolation error, and -step lookahead suboptimality, showing how their constraint mitigates error accumulation. Empirically, TD-M(PC) yields substantial improvements over TD-MPC2, particularly on 61-DoF humanoid tasks in HumanoidBench and DMControl benchmarks, validated by ablations that demonstrate robustness to regularization strength and the central role of conservatism in high-dimensional control.

Abstract

Model-based reinforcement learning algorithms that combine model-based planning and learned value/policy prior have gained significant recognition for their high data efficiency and superior performance in continuous control. However, we discover that existing methods that rely on standard SAC-style policy iteration for value learning, directly using data generated by the planner, often result in \emph{persistent value overestimation}. Through theoretical analysis and experiments, we argue that this issue is deeply rooted in the structural policy mismatch between the data generation policy that is always bootstrapped by the planner and the learned policy prior. To mitigate such a mismatch in a minimalist way, we propose a policy regularization term reducing out-of-distribution (OOD) queries, thereby improving value learning. Our method involves minimum changes on top of existing frameworks and requires no additional computation. Extensive experiments demonstrate that the proposed approach improves performance over baselines such as TD-MPC2 by large margins, particularly in 61-DoF humanoid tasks. View qualitative results at https://darthutopian.github.io/tdmpc_square/.

Paper Structure

This paper contains 22 sections, 10 theorems, 41 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Assume the nominal policy $\pi_k$ is acquired through approximation policy iteration (API) and the resulting planner policy at k-th iteration is $\pi_{H, k}$, given upper bound for value approximation error $\Vert \hat{V}_k - V^{\pi_k} \Vert_\infty \leq \epsilon_k$. Also denote approximation error f for any policy $\mu$, while $C$ is defined as:

Figures (9)

  • Figure 1: Value approximation error for TD-MPC2. The true value is estimated using the average discounted return over 100 episodes following the nominal policy $\pi$; Function estimation is obtained by $\hat{V} = \mathbb{E}_\pi[\hat{Q}]$. The results are averaged over three seeds for an unbiased assessment.
  • Figure 2: Toy Example
  • Figure 3: Humanoid-Bench Locomotion Suite. Average episode return of our method (TD-M(PC)$^2$) and baselines. We report mean performance and 95% CIs across 14 humanoid locomotion tasks. We do not include Reach-v0 in the average result due to its distinct reward scale.
  • Figure 4: DM Control Suite. Average episode return of our method (TD-M(PC)$^2$) and baselines. We report mean performance and 95% CIs across 7 high-dimensional continuous control tasks. We also present the average performance on all algorithms.
  • Figure 5: Value estimation of TD-M(PC)$^2$. The true value and function estimation are obtained with the same approach in Figure \ref{['fig: approx error']}. The proposed significantly mitigates value overestimation for all four tasks.
  • ...and 4 more figures

Theorems & Definitions (14)

  • Theorem 3.1: $H$-step Policy Suboptimality
  • Theorem 4.1: TD-MPC Error Accumulation
  • Theorem 4.2: Policy divergence
  • Lemma A.1
  • Lemma A.2
  • Lemma A.3
  • Theorem A.4: Policy divergence
  • proof
  • Theorem A.5
  • proof
  • ...and 4 more