Table of Contents
Fetching ...

CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

Yunze Liu, Changxi Chen, Zifan Wang, Li Yi

TL;DR

CrossVideo tackles 4D point cloud video understanding under unlabeled data by leveraging cross-modal supervision from image videos. It introduces intra-modal and cross-modal contrastive objectives operating at both video- and frame-level to learn robust spatiotemporal representations, with a two-branch backbone for point cloud and image videos. The method demonstrates state-of-the-art improvements on HOI4D action segmentation and semantic segmentation, and ablations validate the necessity of each loss component and the benefit of cross-modal pretraining. The work also shows that the image video encoder can benefit from and contribute to the learned representations, highlighting practical impact for multimodal 4D understanding.

Abstract

This paper introduces a novel approach named CrossVideo, which aims to enhance self-supervised cross-modal contrastive learning in the field of point cloud video understanding. Traditional supervised learning methods encounter limitations due to data scarcity and challenges in label acquisition. To address these issues, we propose a self-supervised learning method that leverages the cross-modal relationship between point cloud videos and image videos to acquire meaningful feature representations. Intra-modal and cross-modal contrastive learning techniques are employed to facilitate effective comprehension of point cloud video. We also propose a multi-level contrastive approach for both modalities. Through extensive experiments, we demonstrate that our method significantly surpasses previous state-of-the-art approaches, and we conduct comprehensive ablation studies to validate the effectiveness of our proposed designs.

CrossVideo: Self-supervised Cross-modal Contrastive Learning for Point Cloud Video Understanding

TL;DR

CrossVideo tackles 4D point cloud video understanding under unlabeled data by leveraging cross-modal supervision from image videos. It introduces intra-modal and cross-modal contrastive objectives operating at both video- and frame-level to learn robust spatiotemporal representations, with a two-branch backbone for point cloud and image videos. The method demonstrates state-of-the-art improvements on HOI4D action segmentation and semantic segmentation, and ablations validate the necessity of each loss component and the benefit of cross-modal pretraining. The work also shows that the image video encoder can benefit from and contribute to the learned representations, highlighting practical impact for multimodal 4D understanding.

Abstract

This paper introduces a novel approach named CrossVideo, which aims to enhance self-supervised cross-modal contrastive learning in the field of point cloud video understanding. Traditional supervised learning methods encounter limitations due to data scarcity and challenges in label acquisition. To address these issues, we propose a self-supervised learning method that leverages the cross-modal relationship between point cloud videos and image videos to acquire meaningful feature representations. Intra-modal and cross-modal contrastive learning techniques are employed to facilitate effective comprehension of point cloud video. We also propose a multi-level contrastive approach for both modalities. Through extensive experiments, we demonstrate that our method significantly surpasses previous state-of-the-art approaches, and we conduct comprehensive ablation studies to validate the effectiveness of our proposed designs.
Paper Structure (25 sections, 8 equations, 2 figures, 5 tables)

This paper contains 25 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Our core idea is to learn 4D point cloud video representations by leveraging the synergy from image video under a self-supervised manner. We propose intra-modal contrastive objective and cross-modal contrastive objective to learn the spatio-temporal invariance of point cloud videos and the correlation between point cloud videos and image videos.
  • Figure 2: Cross-modality pretraining enables distillation of knowledge from 2D videos into point cloud videos, while intra-modality pretraining allows the encoder to possess invariance to a set of geometric transformations. The feature space at the video level and the frame level can both provide constraints on the encoder at different levels of granularity.