Table of Contents
Fetching ...

Position-aware Guided Point Cloud Completion with CLIP Model

Feng Zhou, Qi Zhang, Ju Dai, Lei Li, Qing Fan, Junliang Xing

TL;DR

This work tackles incomplete point clouds caused by occlusion and view limitations by introducing a CLIP-driven multimodal framework that fuses point data, text, and six projection images. A CLIP-enhanced module constructs a Point-Text-Image corpus (PCN-TI and MVP-TI) and extracts global $F_g$ and local $F_l$ features along with text features to guide completion. A position-aware module splits projection maps into blocks (2x2) and learns per-block cues to localize missing regions, using cross-attention to fuse local and global cues into a decoder-ready representation $F_D$ that is combined with $F_p$ and CLIP features. The proposed PCN-TI and MVP-TI datasets, together with extensive ablation and benchmark results, show state-of-the-art performance on PCN and competitive gains on KITTI, demonstrating the practical value of locating missing areas with multimodal guidance and avoiding heavy reliance on calibration or large language models.

Abstract

Point cloud completion aims to recover partial geometric and topological shapes caused by equipment defects or limited viewpoints. Current methods either solely rely on the 3D coordinates of the point cloud to complete it or incorporate additional images with well-calibrated intrinsic parameters to guide the geometric estimation of the missing parts. Although these methods have achieved excellent performance by directly predicting the location of complete points, the extracted features lack fine-grained information regarding the location of the missing area. To address this issue, we propose a rapid and efficient method to expand an unimodal framework into a multimodal framework. This approach incorporates a position-aware module designed to enhance the spatial information of the missing parts through a weighted map learning mechanism. In addition, we establish a Point-Text-Image triplet corpus PCI-TI and MVP-TI based on the existing unimodal point cloud completion dataset and use the pre-trained vision-language model CLIP to provide richer detail information for 3D shapes, thereby enhancing performance. Extensive quantitative and qualitative experiments demonstrate that our method outperforms state-of-the-art point cloud completion methods.

Position-aware Guided Point Cloud Completion with CLIP Model

TL;DR

This work tackles incomplete point clouds caused by occlusion and view limitations by introducing a CLIP-driven multimodal framework that fuses point data, text, and six projection images. A CLIP-enhanced module constructs a Point-Text-Image corpus (PCN-TI and MVP-TI) and extracts global and local features along with text features to guide completion. A position-aware module splits projection maps into blocks (2x2) and learns per-block cues to localize missing regions, using cross-attention to fuse local and global cues into a decoder-ready representation that is combined with and CLIP features. The proposed PCN-TI and MVP-TI datasets, together with extensive ablation and benchmark results, show state-of-the-art performance on PCN and competitive gains on KITTI, demonstrating the practical value of locating missing areas with multimodal guidance and avoiding heavy reliance on calibration or large language models.

Abstract

Point cloud completion aims to recover partial geometric and topological shapes caused by equipment defects or limited viewpoints. Current methods either solely rely on the 3D coordinates of the point cloud to complete it or incorporate additional images with well-calibrated intrinsic parameters to guide the geometric estimation of the missing parts. Although these methods have achieved excellent performance by directly predicting the location of complete points, the extracted features lack fine-grained information regarding the location of the missing area. To address this issue, we propose a rapid and efficient method to expand an unimodal framework into a multimodal framework. This approach incorporates a position-aware module designed to enhance the spatial information of the missing parts through a weighted map learning mechanism. In addition, we establish a Point-Text-Image triplet corpus PCI-TI and MVP-TI based on the existing unimodal point cloud completion dataset and use the pre-trained vision-language model CLIP to provide richer detail information for 3D shapes, thereby enhancing performance. Extensive quantitative and qualitative experiments demonstrate that our method outperforms state-of-the-art point cloud completion methods.

Paper Structure

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An exemplary instance from our PCN-TI dataset. The PCN-TI dataset comprises many triplets (Point-Text-Image) consisting of projection images, readily available textual descriptions, and incomplete point clouds.
  • Figure 2: The overall architecture of our method consists of two main parts: the CLIP-enhanced module and the Position-aware module. $F_g$, $F_l$, $F_T$, and $F_t$ in the CLIP-enhanced module denote the global-scale feature, the local-scale feature, the text feature from CLIP, and the processed text feature of $F_T$, respectively. $F_c$, $F_c^{'}$, $F_p$, and $F_D$ denote the CLIP feature, processed CLIP feature, point cloud feature, and fusion feature fed into the decoder, respectively.
  • Figure 3: Our starting point for the position-aware module is as follows: (a) a projection image projected from an incomplete car point cloud, (b) the relevancy between the text "This is a map of car" and the projection map; (c) the relevancy between the text "There is a map of car missing a piece from the left side" and the projection map. (d) a projection map derived from an incomplete airplane point cloud. To illustrate the potential limitations of texts in capturing missing parts, we provide (e) and (f) for fair comparisons.
  • Figure 4: Point cloud completion results on PCN dataset. From left to right: partial input, results of GRNet, PoinTr, SnowflakeNet, AdaPoinTr, ours and ground truth. Best viewed in color and zoom in.
  • Figure 5: Qualitative results on the KITTI. From the comparison results, our method can obtain more plausible results compared with other work.