Table of Contents
Fetching ...

Generative Motion Infilling From Imprecisely Timed Keyframes

Purvi Goel, Haotian Zhang, C. Karen Liu, Kayvon Fatahalian

TL;DR

This work tackles generating realistic motion from keyframes that may be imprecisely timed. It proposes a diffusion-based model with a dual output that learns a global time warp and local pose residuals, enabling retiming of constraints while adding detailed motion, trained on synthetic mistimed keyframe data. The approach yields well-timed, diverse, and pose-faithful motion for both synthesis and editing tasks, outperforming baselines that enforce hard timing or rely solely on spatial refinement. By enabling loose timing constraints, the method offers a flexible, trainer-friendly workflow for animators, with potential integration of physics-based postprocessing in future work.

Abstract

Keyframes are a standard representation for kinematic motion specification. Recent learned motion-inbetweening methods use keyframes as a way to control generative motion models, and are trained to generate life-like motion that matches the exact poses and timings of input keyframes. However, the quality of generated motion may degrade if the timing of these constraints is not perfectly consistent with the desired motion. Unfortunately, correctly specifying keyframe timings is a tedious and challenging task in practice. Our goal is to create a system that synthesizes high-quality motion from keyframes, even if keyframes are imprecisely timed. We present a method that allows constraints to be retimed as part of the generation process. Specifically, we introduce a novel model architecture that explicitly outputs a time-warping function to correct mistimed keyframes, and spatial residuals that add pose details. We demonstrate how our method can automatically turn approximately timed keyframe constraints into diverse, realistic motions with plausible timing and detailed submovements.

Generative Motion Infilling From Imprecisely Timed Keyframes

TL;DR

This work tackles generating realistic motion from keyframes that may be imprecisely timed. It proposes a diffusion-based model with a dual output that learns a global time warp and local pose residuals, enabling retiming of constraints while adding detailed motion, trained on synthetic mistimed keyframe data. The approach yields well-timed, diverse, and pose-faithful motion for both synthesis and editing tasks, outperforming baselines that enforce hard timing or rely solely on spatial refinement. By enabling loose timing constraints, the method offers a flexible, trainer-friendly workflow for animators, with potential integration of physics-based postprocessing in future work.

Abstract

Keyframes are a standard representation for kinematic motion specification. Recent learned motion-inbetweening methods use keyframes as a way to control generative motion models, and are trained to generate life-like motion that matches the exact poses and timings of input keyframes. However, the quality of generated motion may degrade if the timing of these constraints is not perfectly consistent with the desired motion. Unfortunately, correctly specifying keyframe timings is a tedious and challenging task in practice. Our goal is to create a system that synthesizes high-quality motion from keyframes, even if keyframes are imprecisely timed. We present a method that allows constraints to be retimed as part of the generation process. Specifically, we introduce a novel model architecture that explicitly outputs a time-warping function to correct mistimed keyframes, and spatial residuals that add pose details. We demonstrate how our method can automatically turn approximately timed keyframe constraints into diverse, realistic motions with plausible timing and detailed submovements.

Paper Structure

This paper contains 27 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: System: Our method for motion infilling with loose timing control accomodates motion synthesis (left), and motion editing (right). In the motion synthesis workflow, the animator provides a set of keyposes, and approximately when these events occur on the timeline (left, top). The union of constrained and unconstrained regions form the observation signal $\mathbf{X}$, which our method converts into detailed, high-fidelity motion $\mathbf{Y}$ (left, bottom). In the the motion editing workflow, the animator starts with an existing high-fidelity motion (right, top), and specifies an edit by providing a new keypose (right, top: pink dot). This can result in in observation signal $\mathbf{X}$ comprising context from the original motion and thew new keypose (right, middle). Then, as our method converts $\mathbf{X}$ into $\mathbf{Y}$ by adding pose and timing detail (right, bottom).
  • Figure 2: Data Collection: We synthetically generate plausible mistimed $\mathbf{X}$ from detailed motion clips $\mathbf{Y}$ (left, top). For each detailed motion sequence $\mathbf{Y}$, we first identify poses that could plausibly have served as keyframes for $\mathbf{Y}$. We select one at random and simulate approximate timing by temporally shifting it by a small integer, which produces $\mathbf{X}_{k+\Delta k}$. We delete a window of neighboring frames (right, top). The submotions outside the deleted window, and $\mathbf{X}_{k+\Delta k}$, form observation signal $\mathbf{X}$ (right, bottom).
  • Figure 3: Diffusion Model Architecture: During training, our two-headed model U (left) learns to predict both a time warp $\mathbf{w}$ and pose details $\Delta \mathbf{X}$ from a shared transformer decoder backbone, given observation signal $\mathbf{X}$ (preprocessed so that all undefined regions are replaced with an interpolation solution), diffusion timestep $t$, and a noisy sequence $\mathbf{Y}^{t}$. $\mathbf{w}$ is applied to $\mathbf{X}$ as a global retiming operation, then summed with $\Delta \mathbf{X}$ as a pose detailing operation. At inference (right), U iteratively denoises the sequence from $t=T$ to $t=0$. We use $\cup$ to represent the "flatten" operator.
  • Figure 4: Motion synthesis: starting from approximately timed keyframe constraints (top row, red) of a character raising and grabbing its right leg, our model generates detailed motion $\mathbf{Y}$; we show two generations here (middle row, bottom row). Our model can capture different modes of motion with different seeds. One seed produces $\mathbf{Y}$ (middle row) where the character loses its balance, then recovers. Another seed produces $\mathbf{Y}$ (bottom row) where the character expertly grabs its leg and pivots. In the latter case, notice how the middle keyframe (top row, second red pose) appears a little later in the generated motion (bottom row).
  • Figure 5: Motion Editing: Given an existing motion (top) of a character kicking with the right leg, the animator wants to make the character kick a second time. The animator creates a new pose (middle row, red) where the character kicks again, and places it at approximately the right place on the timeline: twenty frames after the first kick. Given this input, our model generates detailed motion $\mathbf{Y}$ (bottom).