Table of Contents
Fetching ...

ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara

TL;DR

ProjFlow is introduced, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism, and introduces a time-varying formulation using pseudo-observations that fade during sampling.

Abstract

Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

TL;DR

ProjFlow is introduced, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism, and introduces a time-varying formulation using pseudo-observations that fade during sampling.

Abstract

Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.
Paper Structure (43 sections, 45 equations, 6 figures, 8 tables)

This paper contains 43 sections, 45 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the Projection Sampling Step. At each timestep $t$: (1) predict the clean endpoint $\hat{{\bm{x}}}_1$ from ${\bm{x}}_t$ using the learned velocity $v_\theta({\bm{x}}_t,t)$; (2) enforce the linear–Gaussian measurements ${\bm{y}}=A{\bm{x}}+{\bm{\epsilon}}$ by computing a correction $\Delta {\bm{x}}_1^\star$ that projects $\hat{{\bm{x}}}_1$ to the measurement set under the kinematics-aware metric $R$. This metric encodes skeletal topology and spreads updates coherently along the kinematic tree. The measurement covariance $\Sigma$ modulates the pull toward the observations; smaller values yield stronger attraction and recover hard constraints as $\Sigma \to 0$. (3) Finally, stochastically recompose the corrected endpoint to obtain the next state ${\bm{x}}_{t+\Delta t}$.
  • Figure 2: Pseudo-observations for motion inpainting. Sparse observations are interpolated to guide intermediate frames. This guidance is controlled by two mechanisms: Dynamic Masking activates a time-scheduled neighborhood, and Adaptive Variance treats original observations as hard constraints and the interpolated guides as soft constraints.
  • Figure 3: Text-conditioned pelvis-trajectory control. Given the prompt "a person runs forward in an S-shaped path" and a pelvis control signal, we compare OmniControl xie2024omnicontrol, MaskControl pinyoanuntapong2024controlmm, and ProjFlow (ours). The rendered motions and the trajectory plots both visualize the generated pelvis trajectory (orange) overlaid on the target control signal ( gray dotted line).
  • Figure 4: 2D-to-3D hand-trajectory lifting with text conditioning. The input condition includes the text prompt "a person draws a heart with their hand while walking," an initial 2D keypose, and a left-wrist 2D trajectory shaped like a heart. Sketch2Anim zhong2025sketch2anim fails to reproduce the heart path precisely, the shape collapses, and the subject does not exhibit walking motion. In contrast, ProjFlow follows the heart-shaped wrist trajectory accurately while maintaining a natural walking motion throughout the sequence.
  • Figure 5: Average inference time per 196-frame sample .
  • ...and 1 more figures