Table of Contents
Fetching ...

DiffPoint: Single and Multi-view Point Cloud Reconstruction with ViT Based Diffusion Model

Yu Feng, Xing Shi, Mengli Cheng, Yun Xiong

TL;DR

This work tackles the challenge of reconstructing high-fidelity 3D point clouds from 2D images by bridging vision transformers and diffusion models. DiffPoint introduces a ViT-based diffusion framework that tokenizes irregular point patches and fuses CLIP image embeddings through a unified multi-token Transformer backbone to predict the target point cloud conditioned on one or more views. The approach lever a PointNet-based patch encoder, FPS-KNN point patching, and a self-attention feature fusion module to achieve state-of-the-art performance on ShapeNet for both single-view and multi-view reconstruction, with strong generalization demonstrated on OBJAVERSE-LVIS. The results indicate that a unified ViT-diffusion architecture can effectively reconcile image-to-geometry disparities and scale to diverse 3D datasets, offering a versatile tool for accurate 3D reconstruction from multi-view inputs.

Abstract

As the task of 2D-to-3D reconstruction has gained significant attention in various real-world scenarios, it becomes crucial to be able to generate high-quality point clouds. Despite the recent success of deep learning models in generating point clouds, there are still challenges in producing high-fidelity results due to the disparities between images and point clouds. While vision transformers (ViT) and diffusion models have shown promise in various vision tasks, their benefits for reconstructing point clouds from images have not been demonstrated yet. In this paper, we first propose a neat and powerful architecture called DiffPoint that combines ViT and diffusion models for the task of point cloud reconstruction. At each diffusion step, we divide the noisy point clouds into irregular patches. Then, using a standard ViT backbone that treats all inputs as tokens (including time information, image embeddings, and noisy patches), we train our model to predict target points based on input images. We evaluate DiffPoint on both single-view and multi-view reconstruction tasks and achieve state-of-the-art results. Additionally, we introduce a unified and flexible feature fusion module for aggregating image features from single or multiple input images. Furthermore, our work demonstrates the feasibility of applying unified architectures across languages and images to improve 3D reconstruction tasks.

DiffPoint: Single and Multi-view Point Cloud Reconstruction with ViT Based Diffusion Model

TL;DR

This work tackles the challenge of reconstructing high-fidelity 3D point clouds from 2D images by bridging vision transformers and diffusion models. DiffPoint introduces a ViT-based diffusion framework that tokenizes irregular point patches and fuses CLIP image embeddings through a unified multi-token Transformer backbone to predict the target point cloud conditioned on one or more views. The approach lever a PointNet-based patch encoder, FPS-KNN point patching, and a self-attention feature fusion module to achieve state-of-the-art performance on ShapeNet for both single-view and multi-view reconstruction, with strong generalization demonstrated on OBJAVERSE-LVIS. The results indicate that a unified ViT-diffusion architecture can effectively reconcile image-to-geometry disparities and scale to diverse 3D datasets, offering a versatile tool for accurate 3D reconstruction from multi-view inputs.

Abstract

As the task of 2D-to-3D reconstruction has gained significant attention in various real-world scenarios, it becomes crucial to be able to generate high-quality point clouds. Despite the recent success of deep learning models in generating point clouds, there are still challenges in producing high-fidelity results due to the disparities between images and point clouds. While vision transformers (ViT) and diffusion models have shown promise in various vision tasks, their benefits for reconstructing point clouds from images have not been demonstrated yet. In this paper, we first propose a neat and powerful architecture called DiffPoint that combines ViT and diffusion models for the task of point cloud reconstruction. At each diffusion step, we divide the noisy point clouds into irregular patches. Then, using a standard ViT backbone that treats all inputs as tokens (including time information, image embeddings, and noisy patches), we train our model to predict target points based on input images. We evaluate DiffPoint on both single-view and multi-view reconstruction tasks and achieve state-of-the-art results. Additionally, we introduce a unified and flexible feature fusion module for aggregating image features from single or multiple input images. Furthermore, our work demonstrates the feasibility of applying unified architectures across languages and images to improve 3D reconstruction tasks.
Paper Structure (17 sections, 3 equations, 5 figures, 5 tables)

This paper contains 17 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 2: Illustration of feature aggregation module. Multi image features encoded by CLIP are aggregated through attention mechanism.
  • Figure 3: Single View 3D Reconstruction.The input image is shown in the first column, the other columns show the results for our method compared to various baselines.
  • Figure 4: Multi View 3D Reconstruction. Multi-view object reconstruction using 5 input views (only 1 is displayed). The first column shows the input image, while the remaining columns display the results of our method compared to different baselines.
  • Figure 5: Multi View Reconstruction Results On OBJAVERSE-LVIS. Models are trained using 5 input views. The first column displays one of the input images, while the second column showcases the results for our DiffPoint.
  • Figure 6: Generated samples on OBJAVERSE-LVIS dataset.