Table of Contents
Fetching ...

RGB2Point: 3D Point Cloud Generation from Single RGB Images

Jae Joong Lee, Bedrich Benes

TL;DR

RGB2Point tackles the problem of reconstructing dense 3D point clouds from a single RGB image by replacing diffusion-based generation with a Transformer-based pipeline. It combines a pre-trained Vision Transformer feature extractor with a Contextual Feature Integrator and a Geometric Projection Module to produce configurable-point clouds with high fidelity while using only about 2.3 GB of VRAM and achieving inference times near 0.15 s per image. Across ShapeNet and Pix3D, it outperforms state-of-the-art diffusion and CNN-based baselines in Chamfer distance, Earth Mover’s distance, and F-score, and demonstrates markedly improved cross-category consistency. The approach offers practical impact for fast, on-device 3D reconstruction in robotics and AR/VR, with strong potential for extension to multi-view setups and differentiable rendering.

Abstract

We introduce RGB2Point, an unposed single-view RGB image to a 3D point cloud generation based on Transformer. RGB2Point takes an input image of an object and generates a dense 3D point cloud. Contrary to prior works based on CNN layers and diffusion denoising approaches, we use pre-trained Transformer layers that are fast and generate high-quality point clouds with consistent quality over available categories. Our generated point clouds demonstrate high quality on a real-world dataset, as evidenced by improved Chamfer distance (51.15%) and Earth Mover's distance (45.96%) metrics compared to the current state-of-the-art. Additionally, our approach shows a better quality on a synthetic dataset, achieving better Chamfer distance (39.26%), Earth Mover's distance (26.95%), and F-score (47.16%). Moreover, our method produces 63.1% more consistent high-quality results across various object categories compared to prior works. Furthermore, RGB2Point is computationally efficient, requiring only 2.3GB of VRAM to reconstruct a 3D point cloud from a single RGB image, and our implementation generates the results 15,133x faster than a SOTA diffusion-based model.

RGB2Point: 3D Point Cloud Generation from Single RGB Images

TL;DR

RGB2Point tackles the problem of reconstructing dense 3D point clouds from a single RGB image by replacing diffusion-based generation with a Transformer-based pipeline. It combines a pre-trained Vision Transformer feature extractor with a Contextual Feature Integrator and a Geometric Projection Module to produce configurable-point clouds with high fidelity while using only about 2.3 GB of VRAM and achieving inference times near 0.15 s per image. Across ShapeNet and Pix3D, it outperforms state-of-the-art diffusion and CNN-based baselines in Chamfer distance, Earth Mover’s distance, and F-score, and demonstrates markedly improved cross-category consistency. The approach offers practical impact for fast, on-device 3D reconstruction in robotics and AR/VR, with strong potential for extension to multi-view setups and differentiable rendering.

Abstract

We introduce RGB2Point, an unposed single-view RGB image to a 3D point cloud generation based on Transformer. RGB2Point takes an input image of an object and generates a dense 3D point cloud. Contrary to prior works based on CNN layers and diffusion denoising approaches, we use pre-trained Transformer layers that are fast and generate high-quality point clouds with consistent quality over available categories. Our generated point clouds demonstrate high quality on a real-world dataset, as evidenced by improved Chamfer distance (51.15%) and Earth Mover's distance (45.96%) metrics compared to the current state-of-the-art. Additionally, our approach shows a better quality on a synthetic dataset, achieving better Chamfer distance (39.26%), Earth Mover's distance (26.95%), and F-score (47.16%). Moreover, our method produces 63.1% more consistent high-quality results across various object categories compared to prior works. Furthermore, RGB2Point is computationally efficient, requiring only 2.3GB of VRAM to reconstruct a 3D point cloud from a single RGB image, and our implementation generates the results 15,133x faster than a SOTA diffusion-based model.
Paper Structure (11 sections, 1 equation, 5 figures, 7 tables)

This paper contains 11 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Model Architecture.RGB2Point takes a single view RGB image and extracts image features from the pre-trained ViT vit. The Contextual Feature Integrator then refines these extracted features, which applies a multi-head attention mechanism vaswani2017attention to highlight specific regions of interest within the features. The weighted features are forwarded to the Geometric Projection Module, which maps them into a 3D space, resulting in a point cloud representation. We carefully designed the model, RGB2Point which requires only 2.3GB of VRAM to generate a 3D point cloud from a single RGB image.
  • Figure 2: A qualitative analysis compares 3D point clouds generated by our method, RGB2Point, from single RGB images across airplane, car, and chair categories in ShapeNet against their target point clouds.
  • Figure 3: Generated point cloud data by RGB2Point using images from the real-world dataset Pix3D pix3d. The first column shows an input RGB image, and the next two columns show a reconstructed mesh from LRM hong2023lrm, TripoSR tochilkin2024triposr. The third and fourth columns show reconstructed point clouds from Point-E nichol2022point and LION vahdat2022lion. The sixth left column shows generated point cloud data by RGB2Point and the column with GT shows its ground truth point cloud data. The red arrows highlight differences compared to GT. Also, we show a rotated view from our outputs in the last column.
  • Figure 4: We compare the output of different numbers of point clouds. Our original pipeline generates 1,024 point clouds but we show 128 point clouds. The overall shape is preserved instead of missing a random region of point clouds.
  • Figure 5: Three failure cases from a complex real-world dataset, pix3d, with their input images, single image-based 3D reconstructions as a mesh hong2023lrmtochilkin2024triposr and 3D point cloud nichol2022pointvahdat2022lion, ours, and the ground truth 3D point cloud data.