Table of Contents
Fetching ...

SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation

Anbang Wang, Yuzhuo Ao, Shangzhe Wu, Chi-Keung Tang

Abstract

Native 3D generative models have achieved remarkable fidelity and speed, yet they suffer from a critical limitation: inability to prescribe precise structural articulations, where precise structural control within the native 3D space remains underexplored. This paper proposes SK-Adapter, a simple and yet highly efficient and effective framework that unlocks precise skeletal manipulation for native 3D generation. Moving beyond text or image prompts, which can be ambiguous for precise structure, we treat the 3D skeleton as a first-class control signal. SK-Adapter is a lightweight structural adapter network that encodes joint coordinates and topology into learnable tokens, which are injected into the frozen 3D generation backbone via cross-attention. This smart design allows the model to not only effectively "attend" to specific 3D structural constraints but also preserve its original generative priors. To bridge the data gap, we contribute Objaverse-TMS dataset, a large-scale dataset of 24k text-mesh-skeleton pairs. Extensive experiments confirm that our method achieves robust structural control while preserving the geometry and texture quality of the foundation model, significantly outperforming existing baselines. Furthermore, we extend this capability to local 3D editing, enabling the region specific editing of existing assets with skeletal guidance, which is unattainable by previous methods. Project Page: https://sk-adapter.github.io/

SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation

Abstract

Native 3D generative models have achieved remarkable fidelity and speed, yet they suffer from a critical limitation: inability to prescribe precise structural articulations, where precise structural control within the native 3D space remains underexplored. This paper proposes SK-Adapter, a simple and yet highly efficient and effective framework that unlocks precise skeletal manipulation for native 3D generation. Moving beyond text or image prompts, which can be ambiguous for precise structure, we treat the 3D skeleton as a first-class control signal. SK-Adapter is a lightweight structural adapter network that encodes joint coordinates and topology into learnable tokens, which are injected into the frozen 3D generation backbone via cross-attention. This smart design allows the model to not only effectively "attend" to specific 3D structural constraints but also preserve its original generative priors. To bridge the data gap, we contribute Objaverse-TMS dataset, a large-scale dataset of 24k text-mesh-skeleton pairs. Extensive experiments confirm that our method achieves robust structural control while preserving the geometry and texture quality of the foundation model, significantly outperforming existing baselines. Furthermore, we extend this capability to local 3D editing, enabling the region specific editing of existing assets with skeletal guidance, which is unattainable by previous methods. Project Page: https://sk-adapter.github.io/
Paper Structure (31 sections, 11 equations, 9 figures, 6 tables)

This paper contains 31 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: SK-Adapter efficiently generates 3D assets in native 3D domain from given skeletons with its lightweight and effective designs, which also supports skeleton adaptation (keep the skeleton condition and change the text prompt) and flexible skeleton-based editing. Please refer to our supplementary materials for more examples and demos on editing and skinned/rigged animation.
  • Figure 2: Overview of the SK-Adapter framework. The GRPE module encodes the 3D skeleton's joints and topology into sparse tokens. These tokens are injected into a frozen pre-trained backbone via trainable cross-attention layers.
  • Figure 3: Qualitative comparison of our SK-Adapter with the baselines. For the baseline generations, arrows indicate structural inconsistency with the given skeleton. Short captions indicate one of the apparent problems in their generated results.
  • Figure 4: SK-Adapter enables flexible editing with skeleton, including addition and re-posing.
  • Figure 5: Comparison of different architecture designs. Without Cross-Attention, the model tends to collapse given complex skeleton conditions. On the other hand, Topological Encoder makes the model comprehend better skeleton structures, as indicated by the bounding boxes.
  • ...and 4 more figures