Table of Contents
Fetching ...

Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction

Ziyang Lin, Zixuan Sun, Sanhorn Chen, Xiaoyang Chen, Roy Zhao

TL;DR

<3-5 sentence high-level summary> Real-time sequential control is bottlenecked by inference latency, especially for Transformer-based agents. The authors introduce a speculation-and-correction framework that uses a pretrained world model (TD-MPC2) to draft short-horizon plans and latent rollouts, together with lightweight, mismatch-aware correctors to recycle speculation rather than restart planning. They demonstrate substantial latency reductions and fewer inferences (43.6% fewer, 25% end-to-end speedup) on DMC Humanoid-Walk with only a modest 7.1% drop in cumulative reward, and show that correction is essential for robustness. The work highlights a practical path to accelerate Transformer-based control under tight latency budgets by combining input prediction, speculative execution, and residual correction, with potential extensions to longer horizons and adaptive budgeting.

Abstract

Real-time sequential control agents are often bottlenecked by inference latency. Even modest per-step planning delays can destabilize control and degrade overall performance. We propose a speculation-and-correction framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two-tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid-Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch-aware correction for robust latency reduction.

Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction

TL;DR

<3-5 sentence high-level summary> Real-time sequential control is bottlenecked by inference latency, especially for Transformer-based agents. The authors introduce a speculation-and-correction framework that uses a pretrained world model (TD-MPC2) to draft short-horizon plans and latent rollouts, together with lightweight, mismatch-aware correctors to recycle speculation rather than restart planning. They demonstrate substantial latency reductions and fewer inferences (43.6% fewer, 25% end-to-end speedup) on DMC Humanoid-Walk with only a modest 7.1% drop in cumulative reward, and show that correction is essential for robustness. The work highlights a practical path to accelerate Transformer-based control under tight latency budgets by combining input prediction, speculative execution, and residual correction, with potential extensions to longer horizons and adaptive budgeting.

Abstract

Real-time sequential control agents are often bottlenecked by inference latency. Even modest per-step planning delays can destabilize control and degrade overall performance. We propose a speculation-and-correction framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two-tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid-Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch-aware correction for robust latency reduction.

Paper Structure

This paper contains 20 sections, 6 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Speculative decoding system diagram (vLLM) openlm2024 — shows the draft/verify pipeline and KV handling that our approach extends with input prediction and adaptive budgeting.
  • Figure 2: Decision Transformer architecture chen2021decisiontransformer. Each trajectory is represented as an interleaved token sequence of return, state, and action, which are embedded and passed into a causal Transformer to autoregressively predict future actions.
  • Figure 3: Speculative execution for TD-MPC2. The agent plans a short-horizon action sequence in latent space, executes multiple steps directly, and uses a learned corrector when the real latent deviates from the predicted latent.
  • Figure 4: Gated two-tower corrector. Separate towers encode $z^{\text{real}}$ and $\hat{z}$; a mismatch pathway and gate $g$ modulate the residual action update $\Delta a$ applied to the speculative action.
  • Figure 5: Temporal Transformer corrector. A length-$K$ history of mismatch features is embedded and processed by a Transformer encoder to capture drift over time, producing a residual action correction.
  • ...and 4 more figures