Table of Contents
Fetching ...

A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion

Fangzhou Lin, Zilin Dai, Rigved Sanku, Songlin Hou, Kazunori D Yamada, Haichong K. Zhang, Ziming Zhang

TL;DR

This work challenges the need for single-view image guidance in point cloud completion by proposing a view-free baseline built on a multi-branch encoder with hierarchical self-fusion and attention-based feature fusion. The architecture processes only partial point clouds, using cross- and self-attention to integrate multi-branch representations before decoding to a complete cloud. Extensive ShapeNet-ViPC experiments and ablations demonstrate competitive or superior performance to state-of-the-art view-guided methods, highlighting the potential of view-free approaches. The study also analyzes architectural choices, loss functions, and complexity, offering insights into when and how multiple branches and fusion strategies yield the best trade-offs.

Abstract

The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.

A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion

TL;DR

This work challenges the need for single-view image guidance in point cloud completion by proposing a view-free baseline built on a multi-branch encoder with hierarchical self-fusion and attention-based feature fusion. The architecture processes only partial point clouds, using cross- and self-attention to integrate multi-branch representations before decoding to a complete cloud. Extensive ShapeNet-ViPC experiments and ablations demonstrate competitive or superior performance to state-of-the-art view-guided methods, highlighting the potential of view-free approaches. The study also analyzes architectural choices, loss functions, and complexity, offering insights into when and how multiple branches and fusion strategies yield the best trade-offs.

Abstract

The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Architecture of our attention-enhanced self-fusion network with two branches. (We used three branches for performance comparison in experiments, which can be easily extended from this two-branch case) The incomplete point cloud is processed by two 3D encoders, each extracting hierarchical features across three levels using Set Abstraction Layers (SAL) and Point Transformers (PT). Intermediate features from all levels are fed into the self-fusion network, where cross-attention and self-attention refine the representations. The fused features (six in total—three from each encoder coming from 2 self-fusion networks) are concatenated and passed through a decoder. The decoder, along with the original point cloud, generates a refined point cloud. Farthest Point Sampling (FPS) ensures uniform point distribution, while reconstruction loss guides the learning process.
  • Figure 2: Row-1: Input images. Row-2: Incomplete point clouds. Row-3: View-guided (XMFNet) outputs. Row-4: Our view-free outputs. Row-5: Ground truth.