Table of Contents
Fetching ...

Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

Haochuan Kevin Wang

TL;DR

It is found that feasibility masking is necessary in discrete domains, removing invalid action mass and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling.

Abstract

We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.

Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

TL;DR

It is found that feasibility masking is necessary in discrete domains, removing invalid action mass and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling.

Abstract

We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.
Paper Structure (37 sections, 1 equation, 4 figures, 1 table)

This paper contains 37 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Pseudocode for the diffusion-MPC planning step.
  • Figure 2: Effect of feasibility masking and reranking strategy on mean episode score and percentage of episodes achieving score $> 0$. Masking yields a $6.8\times$ improvement. DQN reranking erases the masking benefit.
  • Figure 3: Compute-aware frontier across configurations. Left: mean score vs. decision latency. Right: survival rate (% episodes with score $>0$) vs. latency. Dashed lines show Pareto frontiers. The masked heuristic $H=4$ configuration is both faster and stronger than masked $H=8$, while DQN reranking is off-frontier at both horizons.
  • Figure 4: Compute scaling: mean episode score vs. number of candidates $K$, with feasibility masking and heuristic reranking ($H=8$). Performance increases strongly with $K$, while latency grows approximately linearly.