Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging
Jia-peng Zhang, Cheng-Feng Pu, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu
TL;DR
This work reframes rigging as a representation problem by introducing SkinTokens, a discrete, learned compression of skinning weights via FSQ-CVAE, enabling a unified autoregressive model TokenRig that jointly generates skeletons and skinning. The approach is further strengthened by a reinforcement learning refinement (GRPO) with tailored rewards to improve generalization to complex, out-of-distribution assets. Empirical results show substantial improvements in skinning fidelity and skeletal accuracy over state-of-the-art methods, with strong compression of skinning data and better local predictability of influence maps. The framework offers a scalable, end-to-end generative pipeline for high-fidelity, robust rigging across diverse 3D characters, with clear avenues for extending the discrete representation and incorporating user guidance.
Abstract
The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.
