Table of Contents
Fetching ...

Semantics-aware Motion Retargeting with Vision-Language Models

Haodong Zhang, ZhiKe Chen, Haocheng Xu, Lei Hao, Xiaofei Wu, Songcen Xu, Zhensong Zhang, Yue Wang, Rong Xiong

TL;DR

This work tackles preserving motion semantics during retargeting across characters by leveraging vision-language models as semantic supervisors. It introduces Semantics-aware Motion reTargeting (SMT), a two-stage framework that first learns skeleton-level motion via graph-based encoders/decoders and then refinements guided by semantic alignment with a frozen vision-language model through a semantics consistency loss $\mathcal{L}_{sem}$ and a geometry penalty $\mathcal{L}_{pen}$. Differentiable skinning and multi-view rendering enable image-domain supervision, with latent semantic embeddings $\mathbf{E}_A, \mathbf{E}_B$ obtained through guiding visual question answering. Empirical results on Mixamo show SMT achieves state-of-the-art motion quality and semantics preservation, reducing interpenetration and improving semantic alignment, and extends to retargeting from real human videos.

Abstract

Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics.

Semantics-aware Motion Retargeting with Vision-Language Models

TL;DR

This work tackles preserving motion semantics during retargeting across characters by leveraging vision-language models as semantic supervisors. It introduces Semantics-aware Motion reTargeting (SMT), a two-stage framework that first learns skeleton-level motion via graph-based encoders/decoders and then refinements guided by semantic alignment with a frozen vision-language model through a semantics consistency loss and a geometry penalty . Differentiable skinning and multi-view rendering enable image-domain supervision, with latent semantic embeddings obtained through guiding visual question answering. Empirical results on Mixamo show SMT achieves state-of-the-art motion quality and semantics preservation, reducing interpenetration and improving semantic alignment, and extends to retargeting from real human videos.

Abstract

Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics.
Paper Structure (19 sections, 16 equations, 19 figures, 5 tables)

This paper contains 19 sections, 16 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Comparison with previous motion retargeting methods. (a) Previous works rely on human-designed joint distance matrix zhang2023skinned or self-contacts between mesh vertices villegas2021contact to ensure semantics preservation. (b) Ours work enforces human-level motion semantics consistency with the extensive knowledge of vision-language models. (c) Comparison of motion quality and semantics preservation on the Mixamo dataset Mixamo. Our method achieves the best motion quality and semantics consistency.
  • Figure 2: Model Architecture. Our semantics-aware motion retargeting framework employs a two-stage pipeline. Initially, the retargeting network consisting of multiple spatial-temporal graph convolution layers is trained at the skeletal level to establish a base model. Subsequently, this model undergoes further refinement and fine-tuning at the semantic level by the alignment of latent semantic embeddings of the source and target, leveraging the extensive knowledge of vision-language models. The latent semantic embedding is extracted by guiding visual question answering. Additionally, the geometry constraints are also enforced during fine-tuning to avoid interpenetration.
  • Figure 3: An example of guiding visual question answering.
  • Figure 4: Qualitative comparision. The results demonstrate that our method can effectively preserve semantics while the baseline methods suffer from interpenetration or semantic information loss. From the first column to the last column are the source motion, the Copy strategy, NKN villegas2018neural, SAN aberman2020skeleton, R2ET zhang2023skinned, our method and text descriptions, respectively.
  • Figure 5: The qualitative comparison of ablation study between the network without fine-tuning (TWS), the network trained with only semantics and geometry fine-tuning (TWF), the network trained with all loss functions (TWA), the network fine-tuned with only the interpenetration loss (FWP) and our full model (All).
  • ...and 14 more figures