CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction
Zhiyuan Wu, Xibin Song, Senbo Wang, Weizhe Liu, Jiayu Yang, Ziang Cheng, Shenzhou Chen, Taizhang Shang, Weixuan Sun, Shan Luo, Pan Ji
TL;DR
CDI3D tackles the challenge of high-quality 3D reconstruction from a single image by decoupling main-view generation from dense view synthesis. It leverages a 2D diffusion model to produce $N=4$ main views, then enriches them with a Dense View Interpolation (DVI) module guided by a tilt camera trajectory, before feeding the expanded view set into a tri-plane-based Large Reconstruction Model for textured mesh generation. The approach yields superior geometry and texture fidelity, improved multi-view consistency, and faster novel-view synthesis compared to video-diffusion baselines, validated on diverse datasets including GSO and Objaverse. The work demonstrates practical impact by enabling efficient, high-quality 3D content creation with broad applicability to AR/VR, robotics, and entertainment, while highlighting avenues for further enhancement via feature-level view super-resolution. The key ideas are the two-stage view generation with density via DVI, elevation-informed camera trajectories, and a cross-modal fusion that efficiently maps multi-view tokens to a fixed tri-plane representation for mesh reconstruction. $N$ and interpolation details are integral to the method’s performance, as reflected in the reported improvements across geometry and texture metrics.
Abstract
3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.
