Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

Junshu Tang; Yanhong Zeng; Ke Fan; Xuheng Wang; Bo Dai; Kai Chen; Lizhuang Ma

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

Junshu Tang, Yanhong Zeng, Ke Fan, Xuheng Wang, Bo Dai, Kai Chen, Lizhuang Ma

TL;DR

This paper tackles automatic, high-fidelity texture design for 3D biped cartoon characters from text by introducing Make-It-Vivid, a UV-space texture generator that leverages a topology-aware UV representation and priors from pretrained text-to-image diffusion models. It builds a text–UVMap dataset via a multi-agent captioning pipeline and fine-tunes a diffusion model with a low-rank adapter, supplemented by adversarial, depth-guided refinement to recover fine details and reduce seams. The approach outperforms existing texture methods on 3DBiCar in terms of alignment to prompts and perceptual quality, while enabling rapid, multi-style texture generation and support for prompt-based editing and stylization. These capabilities promise efficient, scalable 3D character production and animation, with potential for out-of-domain generation and stylized storytelling applications.

Abstract

Creating and animating 3D biped cartoon characters is crucial and valuable in various applications. Compared with geometry, the diverse texture design plays an important role in making 3D biped cartoon characters vivid and charming. Therefore, we focus on automatic texture design for cartoon characters based on input instructions. This is challenging for domain-specific requirements and a lack of high-quality data. To address this challenge, we propose Make-It-Vivid, the first attempt to enable high-quality texture generation from text in UV space. We prepare a detailed text-texture paired data for 3D characters by using vision-question-answering agents. Then we customize a pretrained text-to-image model to generate texture map with template structure while preserving the natural 2D image knowledge. Furthermore, to enhance fine-grained details, we propose a novel adversarial learning scheme to shorten the domain gap between original dataset and realistic texture domain. Extensive experiments show that our approach outperforms current texture generation methods, resulting in efficient character texturing and faithful generation with prompts. Besides, we showcase various applications such as out of domain generation and texture stylization. We also provide an efficient generation system for automatic text-guided textured character generation and animation.

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 12 figures, 4 tables)

This paper contains 19 sections, 3 equations, 12 figures, 4 tables.

Introduction
Related Work
Preliminary
Text-guided UV Texture Generation
Multi-agent Character Captioning
Enhanced UV Texture Generation from Text
Experiments
Dataset
Implementation Details
Comparison with State-of-the-art Approach
Ablation Studies
Applications
Conclusion
Additional Implementation Details
Additional Experiments
...and 4 more sections

Figures (12)

Figure 1: We present Make-it-Vivid, the first attempt that can create plausible and consistent texture in UV space for 3D biped cartoon characters from text input within few seconds. Make-it-vivid enables texture generation with fine-grained details in multiple styles (see above), and also supports efficient text-guided animatable textured character production (see bottom).
Figure 2: Multi-rounds of dialogue for captioning 3D characters. For each rendered image, we hard code three questions and use ChatGPT for asking follow-up three questions, then summarize.
Figure 3: Overall framework for training texture generator. Our method takes a pair of data as input including a texture map $T$, corresponding text description $P$ and mesh model $\mathcal{M}$. We finetune the low-rank adaptor $\Delta\theta$ for pretrained text-to-image diffusion model to generate high quality UV texture. In order to improve the quality and perceptual fidelity of synthetic textures, we introduce adversarial training to enhance the texture details. We leverage synthetic plausible images $I_v$ conditioned by the rendered depth $I^d_v$ generated by ControlNet $\mathcal{C}$ as a proxy to guide this adversarial training.
Figure 4: We showcase texture samples related with prompt "wearing overall" (a) from 3DBiCar luo2023rabit dataset; (b) generated by the texture generator from Rabit luo2023rabit; (c) generated by the simple finetuned LDM; (d) generated by the enhanced fintuned LDM.
Figure 5: Qualitative comparison on the test prompt set with state-of-the-art shape texturing approaches, TEXTure texture, Text2Tex text2tex, LatentPaint latent-nerf, Fantasia3D fantasia3d and Rabit luo2023rabit, We show our results with high-quality and consistent texture faithful to the input prompt.
...and 7 more figures

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

TL;DR

Abstract

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

Authors

TL;DR

Abstract

Table of Contents

Figures (12)