Table of Contents
Fetching ...

Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Suttisak Wizadwongsa, Jinfan Zhou, Edward Li, Jeong Joon Park

TL;DR

This work demonstrates that pre-trained feed-forward 3D reconstruction models can serve as effective latent encoders for training scalable 3D generative models. By standardizing reconstruction latents, applying spatial importance weighting, and adding a perceptual rendering loss, the authors enable efficient training of a linear-scaling, multi-stream transformer (TriFlow) for text-conditioned 3D generation. The approach achieves competitive or state-of-the-art performance in text-to-3D tasks on Objaverse/ShapeNet, outperforming several baselines while avoiding expensive encoder training. This provides a practical pathway to high-quality, scalable 3D content creation by reusing existing reconstruction models rather than building dataset-specific encoders from scratch.

Abstract

Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.

Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

TL;DR

This work demonstrates that pre-trained feed-forward 3D reconstruction models can serve as effective latent encoders for training scalable 3D generative models. By standardizing reconstruction latents, applying spatial importance weighting, and adding a perceptual rendering loss, the authors enable efficient training of a linear-scaling, multi-stream transformer (TriFlow) for text-conditioned 3D generation. The approach achieves competitive or state-of-the-art performance in text-to-3D tasks on Objaverse/ShapeNet, outperforming several baselines while avoiding expensive encoder training. This provides a practical pathway to high-quality, scalable 3D content creation by reusing existing reconstruction models rather than building dataset-specific encoders from scratch.

Abstract

Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.
Paper Structure (27 sections, 5 equations, 8 figures, 6 tables)

This paper contains 27 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Text-to-3D generation. Our TriFlow model is trained on triplanes from a pretrained feed-forward reconstruction model xu2024instantmesh and can generate a high-quality 3D model in a few seconds. Left column: samples of a model trained on Objaverse deitke2024objaverse LVIS. Right: samples of models fine-tuned on ShapeNet chang2015shapenet chairs and cars.
  • Figure 2: Overview of our image-to-3D generation pipeline and architecture. Our framework includes two main components: (1) a dataset preparation stage, where single-view or multi-view images are processed through a feed-forward image-to-triplane model to generate triplanes, and (2) TriFlow, a text-conditioned generative model trained on these triplanes using rectified-flow-based loss ($L_{RF}$) and perceptual loss ($L_{lpips}$) compared against the original images. On the right, we show our model architecture, a multi-stream transformer incorporating a combination of MM-DiT and DiT blocks. Note that $\otimes$ denotes concatenation between token streams.
  • Figure 3: Visualization of triplane features produced by InstantMesh xu2024instantmesh and the mask used for the weighted loss during training. Note the severe noise of the features in the empty spaces.
  • Figure 4: Text-conditional generation on ShapeNet. While LN3Difflan2025ln3diff, a leading text-to-3D method, shows poor visual results when rendered from the side or bottom views, our results are of higher quality with fewer artifacts and adhere better to the input prompts unseen during training. Zoom in for the best view.
  • Figure 5: Some ShapeNet objects cause artifacts when rendered without textures following existing works (a), which damages the geometry inference of feed-forward reconstruction methods (b,c).
  • ...and 3 more figures