ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

Eric Nazarenus; Chuqiao Li; Yannan He; Xianghui Xie; Jan Eric Lenssen; Gerard Pons-Moll

ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

Eric Nazarenus, Chuqiao Li, Yannan He, Xianghui Xie, Jan Eric Lenssen, Gerard Pons-Moll

Abstract

We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.

ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

Abstract

Paper Structure (19 sections, 9 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 19 sections, 9 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Related Work
ActionPlan: A Framework for Diverse Motion Tasks
Preliminaries
Joint Diffusion on Action Plans and Kinematic Motion
Training with Latent-specific Noise Levels
Flexible Sampling with Action Plan Generation
Experiments
Experimental Setup
Baseline Comparison on Text-to-motion Generation
Ablation Studies
User Study
Applications
Conclusion and Limitations
Action Plan Autoencoder
...and 4 more sections

Figures (6)

Figure 1: ActionPlan decouples high-level action planning from low-level motion generation in a single generative model (a). By conditioning motion synthesis on generated action plans, ActionPlan achieves online generation (b) without the typical accuracy drop that happens in existing streaming methods and supports localized edits (c).
Figure 2: Comparison of generation paradigms, where darker shading indicates higher noise levels. By introducing frame-level action plans as semantic conditioning, ActionPlan achieves significantly better FID and R-Precision compared to schedules without ActionPlan. Additionally, our streaming mode completes generation in only $N+T-1$ total steps ($N$: motion tokens, $T$: flow matching steps), enabling efficient low-latency generation without sacrificing motion quality.
Figure 3: Overview of our ActionPlan. (a) During training, motion latents are noised with per-frame heterogeneous timesteps while frame-level text latents share a single global timestep. A Transformer Denoiser is trained to jointly reconstruct both. During inference, the model operates in two modes: in offline mode (b), the action plan is fully generated first and then motion latents are denoised in random pyramid order; in streaming mode (c), the action plan is denoised alongside the first motion frame, followed by raster progressive denoising of the remaining latents.
Figure 4: Qualitative comparison with MARDM meng2025rethinking and MotionStreamer xiao2025motionstreamer on four text prompts. The color varies from light to dark representing the time flow. Incorrectly generated or missing actions are marked with $\times$, and correctly executed actions with $\checkmark$. By generating a frame-level action plan prior to motion synthesis, ActionPlan faithfully executes all specified actions in the correct order, while baselines frequently miss or misorder key actions. See Supp. for videos.
Figure 5: ActionPlan supports diverse downstream applications zero-shot. Darker shading indicates later time steps. Editing (top): regenerates selected latents conditioned on a new prompt (green) while preserving others. Long motion streaming (middle): generates coherent long-horizon motion in successive chunks across prompts. In-betweening (bottom): given fixed start (white) and end (dark grey) poses, fills in the intermediate motion. See Supp. for video results.
...and 1 more figures

ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

Abstract

ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

Authors

Abstract

Table of Contents

Figures (6)