Table of Contents
Fetching ...

VG4D: Vision-Language Model Goes 4D Video Recognition

Zhichao Deng, Xiangtai Li, Xia Li, Yunhai Tong, Shen Zhao, Mengyuan Liu

TL;DR

VG4D addresses 4D point cloud action recognition by transferring Vision-Language Model priors to a 4D encoder. It introduces im-PSTNet as a modernized 4D backbone and trains via cross-modal contrastive learning that aligns 4D, RGB video, and text representations in a shared space. The method optimizes $L_{cl}= alpha L_{pc,video} + beta L_{pc,text}$ and $L_{final}= L_{cl} + theta L_{pc} + gamma L_{rgb}$, with $L_{pc,text} = (1/N) sum_i -log( exp(f_i^T f_i^P) / sum_j exp(f_j^T f_i^P) )$ and $L_{pc,video} = (1/N) sum_i -log( exp(f_i^V f_i^P) / sum_j exp(f_j^V f_i^P) )$. The approach uses language-RGB-4D triplets and ensembles four modality scores at inference, achieving state-of-the-art results on NTU RGB+D 60/120 and demonstrating that VLM priors can compensate for limited texture in point clouds.

Abstract

Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.

VG4D: Vision-Language Model Goes 4D Video Recognition

TL;DR

VG4D addresses 4D point cloud action recognition by transferring Vision-Language Model priors to a 4D encoder. It introduces im-PSTNet as a modernized 4D backbone and trains via cross-modal contrastive learning that aligns 4D, RGB video, and text representations in a shared space. The method optimizes and , with and . The approach uses language-RGB-4D triplets and ensembles four modality scores at inference, achieving state-of-the-art results on NTU RGB+D 60/120 and demonstrating that VLM priors can compensate for limited texture in point clouds.

Abstract

Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.
Paper Structure (11 sections, 8 equations, 3 figures, 5 tables)

This paper contains 11 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) General Pipeline in Existing Methods: Input point cloud video is processed by a 4D encoder and then a standard classifier to generate prediction scores. (b) Our proposed method harnesses the knowledge of a Visual-Language pre-trained model to enhance action recognition performance. (c) Some classification hard cases of point cloud and RGB.
  • Figure 2: Overall architecture of our framework. (a) VG4D (VLM goes 4D). We use a cross-modal contrastive learning objective to train our proposed 4D encoder: im-PSTNet. The knowledge of the VLM is transferred to the 4D encoder by aligning the 4D representation with language and RGB, respectively. During testing, an ensemble approach is used to integrate multiple scores. (b) The overall framework of our proposed im-PSTNet. It consists of a spatial feature extractor and a spatio-temporal feature extractor.
  • Figure 3: Action classification cases for some different modalities.