Position-aware Guided Point Cloud Completion with CLIP Model
Feng Zhou, Qi Zhang, Ju Dai, Lei Li, Qing Fan, Junliang Xing
TL;DR
This work tackles incomplete point clouds caused by occlusion and view limitations by introducing a CLIP-driven multimodal framework that fuses point data, text, and six projection images. A CLIP-enhanced module constructs a Point-Text-Image corpus (PCN-TI and MVP-TI) and extracts global $F_g$ and local $F_l$ features along with text features to guide completion. A position-aware module splits projection maps into blocks (2x2) and learns per-block cues to localize missing regions, using cross-attention to fuse local and global cues into a decoder-ready representation $F_D$ that is combined with $F_p$ and CLIP features. The proposed PCN-TI and MVP-TI datasets, together with extensive ablation and benchmark results, show state-of-the-art performance on PCN and competitive gains on KITTI, demonstrating the practical value of locating missing areas with multimodal guidance and avoiding heavy reliance on calibration or large language models.
Abstract
Point cloud completion aims to recover partial geometric and topological shapes caused by equipment defects or limited viewpoints. Current methods either solely rely on the 3D coordinates of the point cloud to complete it or incorporate additional images with well-calibrated intrinsic parameters to guide the geometric estimation of the missing parts. Although these methods have achieved excellent performance by directly predicting the location of complete points, the extracted features lack fine-grained information regarding the location of the missing area. To address this issue, we propose a rapid and efficient method to expand an unimodal framework into a multimodal framework. This approach incorporates a position-aware module designed to enhance the spatial information of the missing parts through a weighted map learning mechanism. In addition, we establish a Point-Text-Image triplet corpus PCI-TI and MVP-TI based on the existing unimodal point cloud completion dataset and use the pre-trained vision-language model CLIP to provide richer detail information for 3D shapes, thereby enhancing performance. Extensive quantitative and qualitative experiments demonstrate that our method outperforms state-of-the-art point cloud completion methods.
