Table of Contents
Fetching ...

Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

Zhangyang Qi, Yunhan Yang, Mengchen Zhang, Long Xing, Xiaoyang Wu, Tong Wu, Dahua Lin, Xihui Liu, Jiaqi Wang, Hengshuang Zhao

TL;DR

Tailor3D tackles the challenge of fine-grained 3D editing by combining editable front-view images with a generated back view via diffusion, followed by a Dual-sided LRM that fuses front and back triplane features into a coherent 3D object. The approach leverages a LoRA-enhanced Triplane Transformer and Viewpoint Cross-Attention to handle imperfect front/back consistency, enabling rapid, interactive editing (seconds per step) and high-quality reconstructions. Experimental results on Gobjaverse-LVIS demonstrate versatile capabilities in 3D generative fill, texture/style transfer, and comparative advantages over existing image-to-3D methods, with thorough ablations validating design choices. The work promises practical impact for animation, game design, and rapid prototyping by democratizing precise 3D asset customization with dual-view guidance.

Abstract

Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts. Imagine you can get a toy through 3D AIGC but with undesired accessories and dressing. To tackle this challenge, we propose a novel pipeline called Tailor3D, which swiftly creates customized 3D assets from editable dual-side images. We aim to emulate a tailor's ability to locally change objects or perform overall style transfer. Unlike creating 3D assets from multiple views, using dual-side images eliminates conflicts on overlapping areas that occur when editing individual views. Specifically, it begins by editing the front view, then generates the back view of the object through multi-view diffusion. Afterward, it proceeds to edit the back views. Finally, a Dual-sided LRM is proposed to seamlessly stitch together the front and back 3D features, akin to a tailor sewing together the front and back of a garment. The Dual-sided LRM rectifies imperfect consistencies between the front and back views, enhancing editing capabilities and reducing memory burdens while seamlessly integrating them into a unified 3D representation with the LoRA Triplane Transformer. Experimental results demonstrate Tailor3D's effectiveness across various 3D generation and editing tasks, including 3D generative fill and style transfer. It provides a user-friendly, efficient solution for editing 3D assets, with each editing step taking only seconds to complete.

Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

TL;DR

Tailor3D tackles the challenge of fine-grained 3D editing by combining editable front-view images with a generated back view via diffusion, followed by a Dual-sided LRM that fuses front and back triplane features into a coherent 3D object. The approach leverages a LoRA-enhanced Triplane Transformer and Viewpoint Cross-Attention to handle imperfect front/back consistency, enabling rapid, interactive editing (seconds per step) and high-quality reconstructions. Experimental results on Gobjaverse-LVIS demonstrate versatile capabilities in 3D generative fill, texture/style transfer, and comparative advantages over existing image-to-3D methods, with thorough ablations validating design choices. The work promises practical impact for animation, game design, and rapid prototyping by democratizing precise 3D asset customization with dual-view guidance.

Abstract

Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts. Imagine you can get a toy through 3D AIGC but with undesired accessories and dressing. To tackle this challenge, we propose a novel pipeline called Tailor3D, which swiftly creates customized 3D assets from editable dual-side images. We aim to emulate a tailor's ability to locally change objects or perform overall style transfer. Unlike creating 3D assets from multiple views, using dual-side images eliminates conflicts on overlapping areas that occur when editing individual views. Specifically, it begins by editing the front view, then generates the back view of the object through multi-view diffusion. Afterward, it proceeds to edit the back views. Finally, a Dual-sided LRM is proposed to seamlessly stitch together the front and back 3D features, akin to a tailor sewing together the front and back of a garment. The Dual-sided LRM rectifies imperfect consistencies between the front and back views, enhancing editing capabilities and reducing memory burdens while seamlessly integrating them into a unified 3D representation with the LoRA Triplane Transformer. Experimental results demonstrate Tailor3D's effectiveness across various 3D generation and editing tasks, including 3D generative fill and style transfer. It provides a user-friendly, efficient solution for editing 3D assets, with each editing step taking only seconds to complete.
Paper Structure (26 sections, 5 equations, 12 figures, 4 tables)

This paper contains 26 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Results and Pipeline. We show our method for 3D style customization, as well as geometry and texture editing. Our pipeline involves editing images and generating the 3D object using Dual-sided LRM, with each step completed in just 5 seconds, allowing for rapid 3D object customization.
  • Figure 2: Model Architecture of Dual-sided LRM. We start with front and back view images. Then, using LoRA Triplane Transformer, we obtain front and back triplanes. Finally, we ‘tailor’ the two triplane features through rotation and Viewpoint Cross-Attention to obtain the 3D object.
  • Figure 3: LoRA Triplane Transformer. (a) For Cross-Attention, we use the LoRA structure to replace the connection layers of $qkv$ and $output$. (b) For Self-Attention, we replace the connection layers of $input$ and $output$. Details of the LoRA are shown in (c).
  • Figure 4: 3D Generative Fill and 3D Style Transfer. It includes both Geometry Fill and Pattern Fill, allowing us to add or modify local geometric structures or texture patterns of 3D objects. Guidance can be provided through text or images as prompts. Additionally, we offer style images or textual guidance to transform 3D objects into desired styles. Ensuring the maintenance of IP integrity during disguise adds significant practical value to 3D tasks.
  • Figure 5: Compare to Existing 3D Generation. We compare single image-to-3D methods. Wonder3D and TriplaneGaussian have lower resolutions, while LGM often shows ghosting effects with complex textures. Our method, however, achieves superior experimental results.
  • ...and 7 more figures