Table of Contents
Fetching ...

Instructive3D: Editing Large Reconstruction Models with Text Instructions

Kunal Kathare, Ankit Dhiman, K Vikas Gowda, Siddharth Aravindan, Shubham Monga, Basavaraja Shanthappa Vandrotti, Lokesh R Boregowda

TL;DR

Instructive3D addresses the lack of fine-grained editing in large reconstruction models by introducing a text-conditioned triplane diffusion adapter that edits 3D objects within the triplane latent space of a pre-trained LRM. The method couples a Tri-VAE to compress triplane features and a latent diffusion model (LTriD) conditioned on text prompts, trained with identity and edited data generated via InstructPix2Pix. This data-efficient, two-phase training framework preserves geometry while enabling expressive texture and style edits, and demonstrates qualitative and quantitative improvements over strong baselines on Objaverse LVIS. The approach promises practical impact for AR/VR, animation, and game design by enabling precise, natural-language-driven edits without per-instance 3D optimization.

Abstract

Transformer based methods have enabled users to create, modify, and comprehend text and image data. Recently proposed Large Reconstruction Models (LRMs) further extend this by providing the ability to generate high-quality 3D models with the help of a single object image. These models, however, lack the ability to manipulate or edit the finer details, such as adding standard design patterns or changing the color and reflectance of the generated objects, thus lacking fine-grained control that may be very helpful in domains such as augmented reality, animation and gaming. Naively training LRMs for this purpose would require generating precisely edited images and 3D object pairs, which is computationally expensive. In this paper, we propose Instructive3D, a novel LRM based model that integrates generation and fine-grained editing, through user text prompts, of 3D objects into a single model. We accomplish this by adding an adapter that performs a diffusion process conditioned on a text prompt specifying edits in the triplane latent space representation of 3D object models. Our method does not require the generation of edited 3D objects. Additionally, Instructive3D allows us to perform geometrically consistent modifications, as the edits done through user-defined text prompts are applied to the triplane latent representation thus enhancing the versatility and precision of 3D objects generated. We compare the objects generated by Instructive3D and a baseline that first generates the 3D object meshes using a standard LRM model and then edits these 3D objects using text prompts when images are provided from the Objaverse LVIS dataset. We find that Instructive3D produces qualitatively superior 3D objects with the properties specified by the edit prompts.

Instructive3D: Editing Large Reconstruction Models with Text Instructions

TL;DR

Instructive3D addresses the lack of fine-grained editing in large reconstruction models by introducing a text-conditioned triplane diffusion adapter that edits 3D objects within the triplane latent space of a pre-trained LRM. The method couples a Tri-VAE to compress triplane features and a latent diffusion model (LTriD) conditioned on text prompts, trained with identity and edited data generated via InstructPix2Pix. This data-efficient, two-phase training framework preserves geometry while enabling expressive texture and style edits, and demonstrates qualitative and quantitative improvements over strong baselines on Objaverse LVIS. The approach promises practical impact for AR/VR, animation, and game design by enabling precise, natural-language-driven edits without per-instance 3D optimization.

Abstract

Transformer based methods have enabled users to create, modify, and comprehend text and image data. Recently proposed Large Reconstruction Models (LRMs) further extend this by providing the ability to generate high-quality 3D models with the help of a single object image. These models, however, lack the ability to manipulate or edit the finer details, such as adding standard design patterns or changing the color and reflectance of the generated objects, thus lacking fine-grained control that may be very helpful in domains such as augmented reality, animation and gaming. Naively training LRMs for this purpose would require generating precisely edited images and 3D object pairs, which is computationally expensive. In this paper, we propose Instructive3D, a novel LRM based model that integrates generation and fine-grained editing, through user text prompts, of 3D objects into a single model. We accomplish this by adding an adapter that performs a diffusion process conditioned on a text prompt specifying edits in the triplane latent space representation of 3D object models. Our method does not require the generation of edited 3D objects. Additionally, Instructive3D allows us to perform geometrically consistent modifications, as the edits done through user-defined text prompts are applied to the triplane latent representation thus enhancing the versatility and precision of 3D objects generated. We compare the objects generated by Instructive3D and a baseline that first generates the 3D object meshes using a standard LRM model and then edits these 3D objects using text prompts when images are provided from the Objaverse LVIS dataset. We find that Instructive3D produces qualitatively superior 3D objects with the properties specified by the edit prompts.
Paper Structure (19 sections, 4 equations, 33 figures, 2 tables)

This paper contains 19 sections, 4 equations, 33 figures, 2 tables.

Figures (33)

  • Figure 1: An Overview of Instructive3D. The top section illustrates the limitations of existing large reconstruction models (LRMs), which lack the capability for fine-grained control over generated 3D objects. In contrast, the bottom section presents examples of how Instructive3D enables fine-grain control to 3D models using text-based prompts, showcasing the enhanced versatility and control offered by our approach.
  • Figure 2: The architecture of our adapter Instructive3D. The triplane is first generated by the LRM (in this case Real3D); each plane of the triplane is then separated, normalized between [-1,1] and processed through its dedicated encoder, trained specifically for the corresponding plane. The resulting latent planes have their channels concatenated and passed through a conditional UNet ronneberger2015unetconvolutionalnetworksbiomedical model for denoising, in conjunction with a text embedding obtained from a CLIP radford2021learning transformer based on the input text prompt. The denoised output is then separated back into the three planes, which are passed through their respective decoders. Finally, the planes are stacked together to form the conditioned triplane, reflecting the user-specified text-based modifications.
  • Figure 3: Comparison of Meshes Generated by Different Models. The first row displays the meshes generated by the Real3D LRM model, illustrating its base performance. The second row shows the results from a triplane VAE-based approach, where three separate VAEs were trained for each of the 3 planes of the triplane. The third column presents the meshes produced by the UNet model, which was initially trained with null conditioning. This comparison highlights the varying degrees of control and detail achieved by each method.
  • Figure 4: Comparison of Text-Conditioned 3D Generation between Instructive3D(Ours) and baselines. The first row lists the input text prompts. The second row displays the input image, third row show results for our Instructive3D and subsequent rows show results from other baselines. This comparison highlights the effectiveness of our method Instructive3D in providing fine-grained control with the input text descriptions. Notice how other baselines fail to produce edit for the text prompt "add a velvet texture to bag"(last column)
  • Figure 5: Comparison of 2D and 3D VAE. The first row shows the output generated by Real3D and the next two rows compare the output generated by 2D VAE and 3D VAE respectively.
  • ...and 28 more figures