Table of Contents
Fetching ...

CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

Zhiyuan Wu, Xibin Song, Senbo Wang, Weizhe Liu, Jiayu Yang, Ziang Cheng, Shenzhou Chen, Taizhang Shang, Weixuan Sun, Shan Luo, Pan Ji

TL;DR

CDI3D tackles the challenge of high-quality 3D reconstruction from a single image by decoupling main-view generation from dense view synthesis. It leverages a 2D diffusion model to produce $N=4$ main views, then enriches them with a Dense View Interpolation (DVI) module guided by a tilt camera trajectory, before feeding the expanded view set into a tri-plane-based Large Reconstruction Model for textured mesh generation. The approach yields superior geometry and texture fidelity, improved multi-view consistency, and faster novel-view synthesis compared to video-diffusion baselines, validated on diverse datasets including GSO and Objaverse. The work demonstrates practical impact by enabling efficient, high-quality 3D content creation with broad applicability to AR/VR, robotics, and entertainment, while highlighting avenues for further enhancement via feature-level view super-resolution. The key ideas are the two-stage view generation with density via DVI, elevation-informed camera trajectories, and a cross-modal fusion that efficiently maps multi-view tokens to a fixed tri-plane representation for mesh reconstruction. $N$ and interpolation details are integral to the method’s performance, as reflected in the reported improvements across geometry and texture metrics.

Abstract

3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.

CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

TL;DR

CDI3D tackles the challenge of high-quality 3D reconstruction from a single image by decoupling main-view generation from dense view synthesis. It leverages a 2D diffusion model to produce main views, then enriches them with a Dense View Interpolation (DVI) module guided by a tilt camera trajectory, before feeding the expanded view set into a tri-plane-based Large Reconstruction Model for textured mesh generation. The approach yields superior geometry and texture fidelity, improved multi-view consistency, and faster novel-view synthesis compared to video-diffusion baselines, validated on diverse datasets including GSO and Objaverse. The work demonstrates practical impact by enabling efficient, high-quality 3D content creation with broad applicability to AR/VR, robotics, and entertainment, while highlighting avenues for further enhancement via feature-level view super-resolution. The key ideas are the two-stage view generation with density via DVI, elevation-informed camera trajectories, and a cross-modal fusion that efficiently maps multi-view tokens to a fixed tri-plane representation for mesh reconstruction. and interpolation details are integral to the method’s performance, as reflected in the reported improvements across geometry and texture metrics.

Abstract

3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.

Paper Structure

This paper contains 19 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Qualitative comparisons between our DVI module and video diffusion methods in multi-view generation, including SV3D voleti2024sv3d and V3D chen2024v3d. Two generated images are shown here, and images generated by video diffusion networks show inconsistencies due to the lack of connectivity across frames. In contrast, our method ensures strong inter-frame connections, which significantly enhances the multi-view consistency of the generated images.
  • Figure 2: (a) The pipeline of our proposed CDI3D. Starting with a single image, CDI3D first generates main views using a multi-view diffusion model. (b) Interpolated views are then obtained from these main views using DVI module. (c) The images are processed through a ViT to extract feature embeddings, which are then used to generate a high-quality 3D mesh utilizing a tri-plane-based large reconstruction model.
  • Figure 3: Tilt camera pose trajectory design with elevations.
  • Figure 4: Qualitative 3D mesh results generated by CDI3D demonstrate better geometry and texture compared to other baselines, where the forklift and rabbit are from Objaverse dataest, while the others are from GSO dataset.
  • Figure 5: DVI results of elevated camera trajectories and their corresponding reconstructed meshes. To highlight the differences, we present the results with and without a 30 elevation.
  • ...and 3 more figures