Table of Contents
Fetching ...

How Animals Dance (When You're Not Looking)

Xiaojuan Wang, Aleksander Holynski, Brian Curless, Ira Kemelmacher, Steve Seitz

TL;DR

The paper addresses the challenge of generating long, music-synchronized, structured animal dances, which existing models struggle to achieve without extensive animal-specific motion data. It introduces choreography patterns as a high-level control signal, augments a small set of input keyframes with mirrored poses, and employs a graph-optimization framework to map beat-aligned motion segments to keyframe pairs before producing the final video via diffusion and beat-warping. The approach combines choreography pattern extraction from human dances, a directed keyframe-pair graph, mirror-aware keyframe augmentation, and beat-aligned video synthesis to deliver up to 30 seconds of animal dance across many species. A user study and quantitative evaluations show improvements in appearance and visual quality over baselines, highlighting the method's practical potential for entertainment and zoological analysis alike.

Abstract

We present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns -- structured sequences of motion beats that define the long-range structure of a dance -- as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds dance videos across a wide range of animals and music tracks.

How Animals Dance (When You're Not Looking)

TL;DR

The paper addresses the challenge of generating long, music-synchronized, structured animal dances, which existing models struggle to achieve without extensive animal-specific motion data. It introduces choreography patterns as a high-level control signal, augments a small set of input keyframes with mirrored poses, and employs a graph-optimization framework to map beat-aligned motion segments to keyframe pairs before producing the final video via diffusion and beat-warping. The approach combines choreography pattern extraction from human dances, a directed keyframe-pair graph, mirror-aware keyframe augmentation, and beat-aligned video synthesis to deliver up to 30 seconds of animal dance across many species. A user study and quantitative evaluations show improvements in appearance and visual quality over baselines, highlighting the method's practical potential for entertainment and zoological analysis alike.

Abstract

We present a framework for generating music-synchronized, choreography aware animal dance videos. Our framework introduces choreography patterns -- structured sequences of motion beats that define the long-range structure of a dance -- as a novel high-level control signal for dance video generation. These patterns can be automatically estimated from human dance videos. Starting from a few keyframes representing distinct animal poses, generated via text-to-image prompting or GPT-4o, we formulate dance synthesis as a graph optimization problem that seeks the optimal keyframe structure to satisfy a specified choreography pattern of beats. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 seconds dance videos across a wide range of animals and music tracks.

Paper Structure

This paper contains 17 sections, 9 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: System overview. Given a few initially generated keyframes as input, we generate mirrored counterparts, extract choreography pattern from a dance video, and optimize the keyframe structure accordingly. The final dance is synthesized by generating in-between frames with a video diffusion model and warped to the musical beats. Our method is highlighted in gray.
  • Figure 2: Mirrored pose generation. We fine-tune a text-to-image model with ControlNet using the canny edges extracted from each keyframe as conditioning. During inference, mirrored pose images are generated by flipping only the subject edges and using an inpainted background edge composed from the keyframe set.
  • Figure 3: Improving visual consistency by re-generating keyframes with shared background edges.
  • Figure 4: Selected examples from our generated dances. Keyframe pairs are labeled by the choreography pattern label, arranged in the order specified by the choreography pattern. For clarity, we show only a portion of the full sequence here. See supplementary for the complete dance video with music.
  • Figure 5: User ratings of our approach compared to Animate-X on various criteria.
  • ...and 4 more figures