Table of Contents
Fetching ...

ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction

Apoorv Thapliyal, Vinay Lanka, Swathi Baskaran

TL;DR

ObitoNet tackles the problem of high-resolution point cloud reconstruction from sparse 3D data by fusing rich image semantics with precise geometry through a Cross-Attention-based multimodal pipeline. The method combines a Vision Transformer (ViT or ViTMAE) image encoder with an FPS+KNN point cloud tokenizer, and merges their features via Cross-Attention before decoding with a transformer to generate detailed point clouds. Key contributions include the Cross-Attention fusion framework, a data pairing approach for aligned image–point-cloud pairs from Tanks and Temples, and a staged training scheme that enables effective multimodal learning. The approach demonstrates competitive reconstruction quality, scalable across base, ViTMAE, and Large variants, and holds promise for downstream tasks in AR, robotics, and spatial reasoning where 2D–3D fusion is beneficial.

Abstract

ObitoNet employs a Cross Attention mechanism to integrate multimodal inputs, where Vision Transformers (ViT) extract semantic features from images and a point cloud tokenizer processes geometric information using Farthest Point Sampling (FPS) and K Nearest Neighbors (KNN) for spatial structure capture. The learned multimodal features are fed into a transformer-based decoder for high-resolution point cloud reconstruction. This approach leverages the complementary strengths of both modalities rich image features and precise geometric details ensuring robust point cloud generation even in challenging conditions such as sparse or noisy data.

ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction

TL;DR

ObitoNet tackles the problem of high-resolution point cloud reconstruction from sparse 3D data by fusing rich image semantics with precise geometry through a Cross-Attention-based multimodal pipeline. The method combines a Vision Transformer (ViT or ViTMAE) image encoder with an FPS+KNN point cloud tokenizer, and merges their features via Cross-Attention before decoding with a transformer to generate detailed point clouds. Key contributions include the Cross-Attention fusion framework, a data pairing approach for aligned image–point-cloud pairs from Tanks and Temples, and a staged training scheme that enables effective multimodal learning. The approach demonstrates competitive reconstruction quality, scalable across base, ViTMAE, and Large variants, and holds promise for downstream tasks in AR, robotics, and spatial reasoning where 2D–3D fusion is beneficial.

Abstract

ObitoNet employs a Cross Attention mechanism to integrate multimodal inputs, where Vision Transformers (ViT) extract semantic features from images and a point cloud tokenizer processes geometric information using Farthest Point Sampling (FPS) and K Nearest Neighbors (KNN) for spatial structure capture. The learned multimodal features are fed into a transformer-based decoder for high-resolution point cloud reconstruction. This approach leverages the complementary strengths of both modalities rich image features and precise geometric details ensuring robust point cloud generation even in challenging conditions such as sparse or noisy data.

Paper Structure

This paper contains 21 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: PointMAE pointcloud reconstruction sample outputs
  • Figure 2: Overview of the proposed pipeline. ObitoNet integrates image features with point cloud tokens for robust reconstruction.
  • Figure 3: Side-by-side comparison of (a) Point Cloud Visualizations and (b) Cross Attention Weights for the ObitoNet/Base Model
  • Figure 4: Side-by-side comparison of (a) Point Cloud Visualizations and (b) Cross Attention Weights for the ObitoNet/ViTMAE Model
  • Figure 5: Side-by-side comparison of (a) Point Cloud Visualizations and (b) Cross Attention Weights for the ObitoNet/Large Model
  • ...and 3 more figures