Table of Contents
Fetching ...

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang

TL;DR

Stroke3D tackles the challenge of creating animatable 3D assets from intuitive inputs by proposing a skeleton-first, two-stage generation pipeline. It combines $Sk-VAE$ for skeleton graph encoding with a $Sk-DiT$ diffusion model conditioned on 2D strokes and text, followed by mesh synthesis via the TextuRig dataset and $SKA-DPO$ to maximize skeleton-mesh alignment. The authors validate that the approach yields plausible 3D skeletons with low Chamfer Distance and high SKA scores, outperforming baselines on MagicArticulate and SKDream, and they demonstrate robustness to input noise and cross-view generation. This work promises a more accessible, editable path to ready-to-animate 3D characters and assets, enabling broader adoption in AR/VR and film pipelines.

Abstract

Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

TL;DR

Stroke3D tackles the challenge of creating animatable 3D assets from intuitive inputs by proposing a skeleton-first, two-stage generation pipeline. It combines for skeleton graph encoding with a diffusion model conditioned on 2D strokes and text, followed by mesh synthesis via the TextuRig dataset and to maximize skeleton-mesh alignment. The authors validate that the approach yields plausible 3D skeletons with low Chamfer Distance and high SKA scores, outperforming baselines on MagicArticulate and SKDream, and they demonstrate robustness to input noise and cross-view generation. This work promises a more accessible, editable path to ready-to-animate 3D characters and assets, enabling broader adoption in AR/VR and film pipelines.

Abstract

Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton's graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE's decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.
Paper Structure (29 sections, 5 equations, 11 figures, 6 tables)

This paper contains 29 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We present Stroke3D (§ Section \ref{['overview']}), a novel framework that generates rigged 3D meshes from user-drawn strokes and language instructions. Crucially, all examples shown are generated from real human inputs via our provided canvas tool, skinned by automatic skinning tools blender. We show versatile downstream applications, including generation from different viewpoints, structural editing by adding strokes or modifying joint positions, and final animation. Skeleton color represents the depth in 3D space.
  • Figure 2: Skeleton-Caption Pipeline (§ Section \ref{['data_preparation']}). We render the skeleton and its corresponding mesh together into orthogonal projections using pyrender or Blender. This provides the necessary visual context to represent the object's identity and pose clearly. These views are then fed into a Vision-Language Model (VLM) to generate detailed descriptions of the object's identity and pose.
  • Figure 3: Overview of Stroke3D (§ Section \ref{['overview']}). During the training phase, Sk-VAE encodes a skeleton graph into a latent space. Subsequently, Sk-DiT is trained to generate these latent embeddings, conditioned on the corresponding 2D strokes and text prompt. After training with TextuRig, we leverage SKA-DPO to further refine SKDream with a skeleton-mesh alignment reward signal. The right side illustrates the implementation details of our models.
  • Figure 4: Qualitative comparison of skeleton generation (§ Section \ref{['comparison']}). Unlike existing rigging methods that take 3D meshes as input, our approach utilizes 2D projections of a 3D skeleton. This method produces plausible skeletons that more faithfully adhere to the ground truth.
  • Figure 5: Qualitative comparison of skeletal-conditioned multi-view generation (§ Section \ref{['comparison']}). Our method produces higher-quality views that more faithfully adhere to the input skeleton. For simplicity, two of the four generated views are shown.
  • ...and 6 more figures