Table of Contents
Fetching ...

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

Hang Xu, Chen Long, Wenxiao Zhang, Yuan Liu, Zhen Cao, Zhen Dong, Bisheng Yang

TL;DR

EGIInet tackles cross-modal point cloud completion by explicitly guiding information interaction between a partial point cloud and a single-view image. It introduces a unified encoder to align modalities and a separable interaction pathway (SFTnet) supervised by a Gram-matrix based FT-Loss, followed by a simple cross-attention fusion and a XMFnet-like decoder to produce the completed shape. The FT-Loss comprises an Informational Loss and a Structural Loss, with Gram matrices $G(\mathbf{F}) = \mathbf{F}^T \bullet \mathbf{F}$ enabling explicit transfer of structural cues; this yields state-of-the-art results on ShapeNet-ViPC with fewer parameters (9.03M vs 9.57M) and a notable $16\%$ reduction in $l_2$-CD over the previous best. The work demonstrates that explicit guidance in multi-modal fusion improves reliability and accuracy of view-guided completion and suggests broader potential for explicit interaction in multi-modal tasks.

Abstract

In this paper, we explore a novel framework, EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion (ViPC) task, which aims to restore a complete point cloud from a partial one with a single view image. In comparison with previous methods that relied on the global semantics of input images, EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task. Specifically, we propose an explicitly guided information interaction strategy supported by modal alignment for point cloud completion. First, in contrast to previous methods which simply use 2D and 3D backbones to encode features respectively, we unified the encoding process to promote modal alignment. Second, we propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images, thus achieving better guidance for completion. Extensive experiments demonstrate the effectiveness of our framework, and we achieved a new state-of-the-art (+16% CD over XMFnet) in benchmark datasets despite using fewer parameters than the previous methods. The pre-trained model and code and are available at https://github.com/WHU-USI3DV/EGIInet.

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

TL;DR

EGIInet tackles cross-modal point cloud completion by explicitly guiding information interaction between a partial point cloud and a single-view image. It introduces a unified encoder to align modalities and a separable interaction pathway (SFTnet) supervised by a Gram-matrix based FT-Loss, followed by a simple cross-attention fusion and a XMFnet-like decoder to produce the completed shape. The FT-Loss comprises an Informational Loss and a Structural Loss, with Gram matrices enabling explicit transfer of structural cues; this yields state-of-the-art results on ShapeNet-ViPC with fewer parameters (9.03M vs 9.57M) and a notable reduction in -CD over the previous best. The work demonstrates that explicit guidance in multi-modal fusion improves reliability and accuracy of view-guided completion and suggests broader potential for explicit interaction in multi-modal tasks.

Abstract

In this paper, we explore a novel framework, EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion (ViPC) task, which aims to restore a complete point cloud from a partial one with a single view image. In comparison with previous methods that relied on the global semantics of input images, EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task. Specifically, we propose an explicitly guided information interaction strategy supported by modal alignment for point cloud completion. First, in contrast to previous methods which simply use 2D and 3D backbones to encode features respectively, we unified the encoding process to promote modal alignment. Second, we propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images, thus achieving better guidance for completion. Extensive experiments demonstrate the effectiveness of our framework, and we achieved a new state-of-the-art (+16% CD over XMFnet) in benchmark datasets despite using fewer parameters than the previous methods. The pre-trained model and code and are available at https://github.com/WHU-USI3DV/EGIInet.
Paper Structure (28 sections, 11 equations, 7 figures, 4 tables)

This paper contains 28 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Cross-attention weight map projection of our method (b) and XMFnetaiello2022cross (c). Compared with XMFnetaiello2022cross, our method extracts clearer structural information about the images.
  • Figure 2: Architecture of EGIInet. The modal alignment is conducted by Unified Encoder. In the Information Fusion, FT-Loss ($\mathcal{L}_{transfer}$) explicitly guides the indirect interaction between image information and point cloud information.
  • Figure 3: Tokenization process for images (a) and point clouds (b).
  • Figure 4: Qualitative results on ShapeNet-ViPC dataset zhang2021view.
  • Figure 5: Image feature projection of our method with (b) and without (c) FT-loss. By supervising the FT-Loss, information interaction can be explicitly guided so that the image feature can figure out the most helpful information for point cloud completion.
  • ...and 2 more figures