Semantics-aware Motion Retargeting with Vision-Language Models
Haodong Zhang, ZhiKe Chen, Haocheng Xu, Lei Hao, Xiaofei Wu, Songcen Xu, Zhensong Zhang, Yue Wang, Rong Xiong
TL;DR
This work tackles preserving motion semantics during retargeting across characters by leveraging vision-language models as semantic supervisors. It introduces Semantics-aware Motion reTargeting (SMT), a two-stage framework that first learns skeleton-level motion via graph-based encoders/decoders and then refinements guided by semantic alignment with a frozen vision-language model through a semantics consistency loss $\mathcal{L}_{sem}$ and a geometry penalty $\mathcal{L}_{pen}$. Differentiable skinning and multi-view rendering enable image-domain supervision, with latent semantic embeddings $\mathbf{E}_A, \mathbf{E}_B$ obtained through guiding visual question answering. Empirical results on Mixamo show SMT achieves state-of-the-art motion quality and semantics preservation, reducing interpenetration and improving semantic alignment, and extends to retargeting from real human videos.
Abstract
Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics.
