Table of Contents
Fetching ...

Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

Jia-peng Zhang, Cheng-Feng Pu, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu

TL;DR

This work reframes rigging as a representation problem by introducing SkinTokens, a discrete, learned compression of skinning weights via FSQ-CVAE, enabling a unified autoregressive model TokenRig that jointly generates skeletons and skinning. The approach is further strengthened by a reinforcement learning refinement (GRPO) with tailored rewards to improve generalization to complex, out-of-distribution assets. Empirical results show substantial improvements in skinning fidelity and skeletal accuracy over state-of-the-art methods, with strong compression of skinning data and better local predictability of influence maps. The framework offers a scalable, end-to-end generative pipeline for high-fidelity, robust rigging across diverse 3D characters, with clear avenues for extending the discrete representation and incorporating user guidance.

Abstract

The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.

Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

TL;DR

This work reframes rigging as a representation problem by introducing SkinTokens, a discrete, learned compression of skinning weights via FSQ-CVAE, enabling a unified autoregressive model TokenRig that jointly generates skeletons and skinning. The approach is further strengthened by a reinforcement learning refinement (GRPO) with tailored rewards to improve generalization to complex, out-of-distribution assets. Empirical results show substantial improvements in skinning fidelity and skeletal accuracy over state-of-the-art methods, with strong compression of skinning data and better local predictability of influence maps. The framework offers a scalable, end-to-end generative pipeline for high-fidelity, robust rigging across diverse 3D characters, with clear avenues for extending the discrete representation and incorporating user guidance.

Abstract

The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.
Paper Structure (45 sections, 10 equations, 9 figures, 6 tables)

This paper contains 45 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of the TokenRig Framework. Our method consists of three key stages: (1) Learning SkinTokens (Section \ref{['sec:FSQ-CVAE']}): We first train a FSQ-CVAE kingma2013autosohn2015learningmentzer2023finite to compress sparse skinning weights into a compact, discrete representation. Mesh geometry and skinning weights are processed by VecSet zhang20233dshape2vecset encoders, and the resulting features are discretized into SkinTokens via Finite Scalar Quantization (FSQ) mentzer2023finite. We employ nested dropout bachmann2025flextokrippel2014learning and importance sampling to ensure robust reconstruction of active deformation regions. (2) Unified Autoregressive Modeling (Section \ref{['sec:method-generation']}): We formulate rigging as a sequence generation task. A Transformer generates a single, unified sequence comprising the complete skeleton followed by the learned SkinTokens (from Stage 1), conditioned on global shape embeddings to capture structural dependencies. (3) RL Refinement via GRPO (Section \ref{['sec:GRPO']}): To improve generalization to in-the-wild assets, we fine-tune the model using Group Relative Policy Optimization (GRPO) liu2024deepseek. We introduce four specific rewards: Volumetric Joint Coverage (ensuring bone distribution), Bone-Mesh Containment (preventing protrusion), Skinning Coverage and Sparsity (ensuring valid weighting), and Deformation Smoothness (preventing artifacts during animation).
  • Figure 2: Gradient Analysis of Loss Functions. A comparison of Binary Cross Entropy (BCE) and Dice loss sudre2017generalised landscapes for a target weight $w=0.2$. While both minimize at the correct value, Dice loss provides significantly larger gradients for non-zero targets ($w_{\text{pred}} \in [0,1]$), effectively counteracting the extreme sparsity of skinning matrices where BCE gradients tend to vanish.
  • Figure 3: SkinTokens Reconstruction Fidelity. We evaluate the reconstruction quality (IoU and L1 Error) of the FSQ-CVAE across varying codebook sizes $C$ and token sequence lengths $T_D$. The results demonstrate that SkinTokens achieve high fidelity with as few as $4$ tokens, validating the compressibility of skinning data. The configuration $C = [8,8,8,6,5] = 15,360$ (lines with circles) is selected for our final model for its superior balance of compression and accuracy. The figure reports the IoU scores at $\varepsilon = 10^{-2}$ and corresponding $L_1$ reconstruction errors on the Articulation 2.0 song2025magicarticulate test dataset
  • Figure 4: Learned Semantics of SkinTokens. A t-SNE visualization of the continuous latent vectors $L_W$ prior to quantization, sampled from $300$ instances in the VRoid dataset isozaki2021vroid. Points are colored by bone category (e.g., Head, Hips). The clear emergence of anatomical clusters indicates that the encoder captures a semantic structural prior, learning to represent "body part concepts" invariant to specific mesh geometries.
  • Figure 5: Qualitative Comparison of Skeleton Generation. We compare TokenRig (Ours) against state-of-the-art baselines. While baseline methods exhibit partial structures, missing details, or redundant joints, our method synthesizes structurally coherent and semantically faithful skeletons across diverse character types.
  • ...and 4 more figures