Table of Contents
Fetching ...

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

Zhi Zuo, Chenyi Zhuang, Pan Gao, Jie Qin, Hao Feng, Nicu Sebe

TL;DR

Uni4D tackles the challenge of learning expressive 4D representations from point cloud videos without explicit motion cues. It introduces a self-disentangled MAE that aligns high-level semantics in the latent space while reconstructing geometry in Euclidean space, using two learnable tokens in a shared decoder to disentangle high-level semantics from low-level geometry. The method employs four objectives—$L_{geo}$, $L_{lat}$, $L_{global}$, and $L_{motion}$—and demonstrates strong gains across MSR-Action3D, NTU-RGBD, HOI4D, NvGestures, and SHREC'17, including a notable +3.8% improvement in HOI4D action segmentation. The findings show that the pre-trained encoder can yield discriminative 4D representations without task-specific motion priors, enabling robust fine-tuning across coarse- and fine-grained 4D tasks and suggesting broad applicability to robotics and vision systems dealing with dynamic 3D data.

Abstract

Self-supervised representation learning for point cloud videos remains a challenging problem with two key limitations: (1) existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations; (2) prior Masked AutoEncoder (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data. In this work, we propose a novel self-disentangled MAE for learning expressive, discriminative, and transferable 4D representations. To overcome the first limitation, we learn motion by aligning high-level semantics in the latent space \textit{without any explicit knowledge}. To tackle the second, we introduce a \textit{self-disentangled learning} strategy that incorporates the latent token with the geometry token within a shared decoder, effectively disentangling low-level geometry and high-level semantics. In addition to the reconstruction objective, we employ three alignment objectives to enhance temporal understanding, including frame-level motion and video-level global information. We show that our pre-trained encoder surprisingly discriminates spatio-temporal representation without further fine-tuning. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 demonstrate the superiority of our approach in both coarse-grained and fine-grained 4D downstream tasks. Notably, Uni4D improves action segmentation accuracy on HOI4D by $+3.8\%$.

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

TL;DR

Uni4D tackles the challenge of learning expressive 4D representations from point cloud videos without explicit motion cues. It introduces a self-disentangled MAE that aligns high-level semantics in the latent space while reconstructing geometry in Euclidean space, using two learnable tokens in a shared decoder to disentangle high-level semantics from low-level geometry. The method employs four objectives—, , , and —and demonstrates strong gains across MSR-Action3D, NTU-RGBD, HOI4D, NvGestures, and SHREC'17, including a notable +3.8% improvement in HOI4D action segmentation. The findings show that the pre-trained encoder can yield discriminative 4D representations without task-specific motion priors, enabling robust fine-tuning across coarse- and fine-grained 4D tasks and suggesting broad applicability to robotics and vision systems dealing with dynamic 3D data.

Abstract

Self-supervised representation learning for point cloud videos remains a challenging problem with two key limitations: (1) existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations; (2) prior Masked AutoEncoder (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data. In this work, we propose a novel self-disentangled MAE for learning expressive, discriminative, and transferable 4D representations. To overcome the first limitation, we learn motion by aligning high-level semantics in the latent space \textit{without any explicit knowledge}. To tackle the second, we introduce a \textit{self-disentangled learning} strategy that incorporates the latent token with the geometry token within a shared decoder, effectively disentangling low-level geometry and high-level semantics. In addition to the reconstruction objective, we employ three alignment objectives to enhance temporal understanding, including frame-level motion and video-level global information. We show that our pre-trained encoder surprisingly discriminates spatio-temporal representation without further fine-tuning. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 demonstrate the superiority of our approach in both coarse-grained and fine-grained 4D downstream tasks. Notably, Uni4D improves action segmentation accuracy on HOI4D by .

Paper Structure

This paper contains 32 sections, 12 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison of motion patterns learned by the pre-trained encoder (without fine-tuning). Uni4D shows consistent activation in the foot region during "kick forward", indicating a capture of long-term temporal dependencies.
  • Figure 2: Overview of the Uni4D framework. Our approach aligns the motion and global semantics in the latent space while reconstructing geometry in Euclidean space. Two learnable tokens are used to disentangle low-level and high-level features during decoding. The pre-trained encoder learns discriminative 4D representations to boost various 4D downstream tasks.
  • Figure 3: Ablation study of global alignment. We report the train and validate losses during fine-tuning.
  • Figure 4: We compare the learned representation of Uni4D and MaST-Pre through visualization of attention and t-SNE. Note features are obtained from the pre-trained encoder without fine-tuning.
  • Figure 5: Visualization of reconstruction results.
  • ...and 2 more figures