DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Haocheng Yuan; Adrien Bousseau; Hao Pan; Lei Zhong; Changjian Li

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Haocheng Yuan, Adrien Bousseau, Hao Pan, Lei Zhong, Changjian Li

Abstract

Creating compelling 3D character animations typically requires either expert use of professional software or expensive motion capture systems operated by skilled actors. We present DancingBox, a lightweight, vision-based system that makes motion capture accessible to novices by reimagining the process as digital puppetry. Instead of tracking precise human motions, DancingBox captures the approximate movements of everyday objects manipulated by users with a single webcam. These coarse proxy motions are then refined into realistic character animations by conditioning a generative motion model on bounding-box representations, enriched with human motion priors learned from large-scale datasets. To overcome the lack of paired proxy-animation data, we synthesize training pairs by converting existing motion capture sequences into proxy representations. A user study demonstrates that DancingBox enables intuitive and creative character animation using diverse proxies, from plush toys to bananas, lowering the barrier to entry for novice animators.

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Abstract

Paper Structure (30 sections, 1 equation, 9 figures, 1 table)

This paper contains 30 sections, 1 equation, 9 figures, 1 table.

Introduction
Related Work
Character Animation with Physical Proxies
Vision Foundation Models
Motion Generative Models
Method
System Setup and User Input
Motion Capturing with Vision Foundation Models
Box-guided Motion Generation
User Experience Study
Study Methodology
Study participants
Experimental setup
Study protocol
System Usability and Result Motion Quality
...and 15 more sections

Figures (9)

Figure 1: Our lightweight motion capture module produces noisy, partial point clouds, which we need to relate to the virtual skeletons used to represent character animation datasets. Extracting a clean skeleton from a point cloud, or synthesizing a defect-laden point cloud from a skeleton, are two difficult tasks. Our key observation is that abstract bounding boxes form a suitable middle-ground representation as they are easy to extract from both point cloud and skeleton data.
Figure 2: Overview of MoCap. From left to right: given the recorded video and user clicks in the first frame, we exploit SAM2 ravi2024sam to segment the parts of the puppet in all frames. The user clicks indicate the desired part segmentation, with different colors denoting different parts. The order of the clicks does not influence the result. We also run $\pi^3$wang2025pi to estimate the point cloud of each frame, and CoTracker3 karaev2024cotracker3 to produce dense pixel-wise correspondences between frames. Combining semantic segments, point clouds, and motion tracks allows us to recover 3D bounding boxes of proxy parts and their motion across the video clip.
Figure 3: Left: overview of our conditional motion generator. We train a custom box motion encoder alongside a ControlNet module to condition a pre-trained Motion Diffusion Model. Right: for a given frame, our box motion encoder first encodes each vertex of a box (e.g., the pink one) using an MLP, and aggregates all vertices of the box into a single code using mean and max operations to obtain a latent code that is invariant to vertex ordering. A self-attention layer then exchanges information between the latent codes of all boxes of the character proxy. Finally, the resulting latent codes are aggregated into a single code for the entire character, again using mean and max to be invariant to box ordering.
Figure 4: Impact of spatial guidance and its design illustration. Left: without spatial guidance, the bounding boxes are able to provide rough motion control, but do not guarantee precise alignment between boxes and joints (e.g., the joints on the right leg vs. the red box). Middle: for an exemplar bounding box (in red), the guidance measurement loss is computed against every joint, then accumulated using distance-based combination weights. Five such distances are visualized, with colors indicating the corresponding weights. Right: With the designed spatial guidance, the generated motion aligns closely with bounding boxes, ensuring each box contains at least one joint.
Figure 5: Representative replication results from our user study. Given a target motion (right), participants were asked to reproduce it (middle) by manipulating a designated physical proxy (left).
...and 4 more figures

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Abstract

DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies

Authors

Abstract

Table of Contents

Figures (9)