Table of Contents
Fetching ...

MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

Zenghao Chai, Chen Tang, Yongkang Wong, Xulei Yang, Mohan Kankanhalli

TL;DR

MimiCAT tackles category-free 3D pose transfer by learning soft, many-to-many keypoint correspondences via a cascade-transformer framework. A novel PokeAnimDB dataset of ~4.4 million pose samples across hundreds of characters supports a learned pose prior and shape-aware deformations, enabling跨-category pose retargeting with semantic keypoint labels. The two-stage training—correspondence learning with text-guided supervision and cycle-consistent pose refinement guided by a learned prior—yields state-of-the-art results on cross-category transfer and enables downstream text-to-motion transfer to arbitrary characters. This approach broadens pose transfer applicability, reduces reliance on humanoid-like morphologies, and provides a valuable resource for general 3D animation and motion synthesis.

Abstract

3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT transfers plausible poses across different characters, significantly outperforming prior methods that are limited to narrow category transfer (e.g., humanoid-to-humanoid).

MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

TL;DR

MimiCAT tackles category-free 3D pose transfer by learning soft, many-to-many keypoint correspondences via a cascade-transformer framework. A novel PokeAnimDB dataset of ~4.4 million pose samples across hundreds of characters supports a learned pose prior and shape-aware deformations, enabling跨-category pose retargeting with semantic keypoint labels. The two-stage training—correspondence learning with text-guided supervision and cycle-consistent pose refinement guided by a learned prior—yields state-of-the-art results on cross-category transfer and enables downstream text-to-motion transfer to arbitrary characters. This approach broadens pose transfer applicability, reduces reliance on humanoid-like morphologies, and provides a valuable resource for general 3D animation and motion synthesis.

Abstract

3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT transfers plausible poses across different characters, significantly outperforming prior methods that are limited to narrow category transfer (e.g., humanoid-to-humanoid).

Paper Structure

This paper contains 28 sections, 11 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: MimiCAT for category-free 3D pose transfer. Given source character with desired poses (left), our model faithfully transfers the given pose to the target characters (right) across completely different categories, proportions and topologies, without requirement of manually labeled correspondence.
  • Figure 2: Overview of MimiCAT for category-free pose transfer. MimiCAT takes a paired source pose and target character as input. It first employs the correspondence transformer $\mathcal{G}$ to estimate soft keypoint correspondences, then refines the initialized transformations using the pose transfer transformer $\mathcal{H}$ to generate the target transformations. Finally, the target character is deformed into the desired pose through linear blend skinning (LBS).
  • Figure 3: Pose examples from the PokeAnimDB. Our dataset covers a wide range of species (including humanoids, insects, quadrupeds, fishes, etc.) with high-quality, artist-designed poses.
  • Figure 4: Overview of the correspondence transformer $\mathcal{G}$. We (a) first extract shape and keypoint tokens using the shape projector and keypoint encoder, (b) fuse shape conditions with respective keypoint latents through transformer blocks, (c) estimate correspondences via learnable affinity weights followed by the Sinkhorn algorithm, and (d) produce soft-matching correspondences between the given characters.
  • Figure 5: Overview of the pose transfer transformer $\mathcal{H}$. We (a) first perform cross-attention to extract deformation-aware cues for shape tokenization and apply correspondence-aware initialization for keypoint tokenization. (b) The shape and keypoint tokens are fed into transformer blocks to derive high-level representations, and decode into refined target transformations. (c) the posed target mesh is generated by deforming the canonical target through Eq. \ref{['eq.lbs']}.
  • ...and 9 more figures