Table of Contents
Fetching ...

AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models

Kwan Yun, Seokhyeon Hong, Chaelin Kim, Junyong Noh

TL;DR

AnyMoLe introduces a data-efficient approach to motion in-betweening by leveraging video diffusion models to synthesize intermediate frames for arbitrary characters without external data. It combines ICAdapt domain adaptation, a two-stage video generation process, and motion-video mimicking with a scene-specific 3D joint estimator to produce smooth, 3D-consistent motions. Across humanoid and non-humanoid characters, it outperforms baselines on quantitative metrics and is validated by a user study, demonstrating practical utility for broad animation tasks. The work acknowledges runtime and ambiguity challenges, proposing future directions toward faster, context-aware, character-agnostic 3D pose estimation to enhance robustness.

Abstract

Despite recent advancements in learning-based motion in-betweening, a key limitation has been overlooked: the requirement for character-specific datasets. In this work, we introduce AnyMoLe, a novel method that addresses this limitation by leveraging video diffusion models to generate motion in-between frames for arbitrary characters without external data. Our approach employs a two-stage frame generation process to enhance contextual understanding. Furthermore, to bridge the domain gap between real-world and rendered character animations, we introduce ICAdapt, a fine-tuning technique for video diffusion models. Additionally, we propose a ``motion-video mimicking'' optimization technique, enabling seamless motion generation for characters with arbitrary joint structures using 2D and 3D-aware features. AnyMoLe significantly reduces data dependency while generating smooth and realistic transitions, making it applicable to a wide range of motion in-betweening tasks.

AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models

TL;DR

AnyMoLe introduces a data-efficient approach to motion in-betweening by leveraging video diffusion models to synthesize intermediate frames for arbitrary characters without external data. It combines ICAdapt domain adaptation, a two-stage video generation process, and motion-video mimicking with a scene-specific 3D joint estimator to produce smooth, 3D-consistent motions. Across humanoid and non-humanoid characters, it outperforms baselines on quantitative metrics and is validated by a user study, demonstrating practical utility for broad animation tasks. The work acknowledges runtime and ambiguity challenges, proposing future directions toward faster, context-aware, character-agnostic 3D pose estimation to enhance robustness.

Abstract

Despite recent advancements in learning-based motion in-betweening, a key limitation has been overlooked: the requirement for character-specific datasets. In this work, we introduce AnyMoLe, a novel method that addresses this limitation by leveraging video diffusion models to generate motion in-between frames for arbitrary characters without external data. Our approach employs a two-stage frame generation process to enhance contextual understanding. Furthermore, to bridge the domain gap between real-world and rendered character animations, we introduce ICAdapt, a fine-tuning technique for video diffusion models. Additionally, we propose a ``motion-video mimicking'' optimization technique, enabling seamless motion generation for characters with arbitrary joint structures using 2D and 3D-aware features. AnyMoLe significantly reduces data dependency while generating smooth and realistic transitions, making it applicable to a wide range of motion in-betweening tasks.

Paper Structure

This paper contains 20 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: AnyMoLe generates in-between motion from context frames and keyframes without requiring external training data.
  • Figure 2: Overview of AnyMoLe: First, the video diffusion model is fine-tuned without using any external data (Sec. \ref{['subsec:ICAdapt']}) while the scene-specific joint estimator is trained (Sec. \ref{['subsec:pose']}). Next, the fine-tuned video generation model produces an in-between video (Sec. \ref{['subsec:videogen']}), which is then refined through motion video mimicking to generate the final in-between motion (Sec. \ref{['subsec:mimicking']}).
  • Figure 3: Overview of the ICAdapt training process. The spatial module and image injection module are trained, while the others are frozen.
  • Figure 4: Context frames guided video generation process.
  • Figure 5: Two stage inference of $D_{adp}$. First, at coarse stage, low frame-rate video is generated in auto regressive manner. Next, high frame-rate video is generated from low frame-rate video.
  • ...and 5 more figures