Table of Contents
Fetching ...

ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

Hongbin Xu, Weitao Chen, Zhipeng Zhou, Feng Xiao, Baigui Sun, Mike Zheng Shou, Wenxiong Kang

TL;DR

ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM), is introduced and the strong generalization capacity of this model is demonstrated.

Abstract

Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.

ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

TL;DR

ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM), is introduced and the strong generalization capacity of this model is demonstrated.

Abstract

Despite recent advancements in 3D generation methods, achieving controllability still remains a challenging issue. Current approaches utilizing score-distillation sampling are hindered by laborious procedures that consume a significant amount of time. Furthermore, the process of first generating 2D representations and then mapping them to 3D lacks internal alignment between the two forms of representation. To address these challenges, we introduce ControLRM, an end-to-end feed-forward model designed for rapid and controllable 3D generation using a large reconstruction model (LRM). ControLRM comprises a 2D condition generator, a condition encoding transformer, and a triplane decoder transformer. Instead of training our model from scratch, we advocate for a joint training framework. In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM. In the image training branch, we unlock the triplane decoder to establish an implicit alignment between the 2D and 3D representations. To ensure unbiased evaluation, we curate evaluation samples from three distinct datasets (G-OBJ, GSO, ABO) rather than relying on cherry-picking manual generation. The comprehensive experiments conducted on quantitative and qualitative comparisons of 3D controllability and generation quality demonstrate the strong generalization capacity of our proposed approach.

Paper Structure

This paper contains 35 sections, 26 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Performance and efficiency comparison among different conditional 3D generation methods. Fig. (a) shows the average time consumption on a single V100-32G GPU of different methods. Our ControLRM-T and ControLRM-D can respectively achieve 60 and 18 times faster inference speed compared with the fastest baseline, MVControlli2024controllable. Fig (b) shows the results of 15 evaluation metrics on the G-Objaverse test set, including 3D controllability metrics (introduced in Sec. \ref{['exp:controllability:metrics']}) and controllable 3D generation metrics (introduced in Sec. \ref{['exp:generation:metrics']}).
  • Figure 2: The overall framework of ControLRM, a feed-forward controllable 3D generation model.
  • Figure 3: The architecture of the 2D conditional generator in ControLRM. (a) shows the transformer-based generator in ControLRM-T, and (b) shows the diffusion-based generator in ControLRM-D.
  • Figure 4: Visualization comparison of controllability under different conditional controls (Edge/Depth/Normal/Sketch).
  • Figure 5: Qualitative comparison with SOTA 3D generation methods, including MVControl li2024controllable, DreamGaussian tang2023dreamgaussian, and VolumeDiffusion tang2023volumediffusion. To avoid cherry-picking, the input conditions are extracted from G-OBJ, GSO, and ABO datasets. None of the images are observed by our model during training. Please zoom in for clearer visualization.
  • ...and 2 more figures