Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos
Zhi Zuo, Chenyi Zhuang, Pan Gao, Jie Qin, Hao Feng, Nicu Sebe
TL;DR
Uni4D tackles the challenge of learning expressive 4D representations from point cloud videos without explicit motion cues. It introduces a self-disentangled MAE that aligns high-level semantics in the latent space while reconstructing geometry in Euclidean space, using two learnable tokens in a shared decoder to disentangle high-level semantics from low-level geometry. The method employs four objectives—$L_{geo}$, $L_{lat}$, $L_{global}$, and $L_{motion}$—and demonstrates strong gains across MSR-Action3D, NTU-RGBD, HOI4D, NvGestures, and SHREC'17, including a notable +3.8% improvement in HOI4D action segmentation. The findings show that the pre-trained encoder can yield discriminative 4D representations without task-specific motion priors, enabling robust fine-tuning across coarse- and fine-grained 4D tasks and suggesting broad applicability to robotics and vision systems dealing with dynamic 3D data.
Abstract
Self-supervised representation learning for point cloud videos remains a challenging problem with two key limitations: (1) existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations; (2) prior Masked AutoEncoder (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data. In this work, we propose a novel self-disentangled MAE for learning expressive, discriminative, and transferable 4D representations. To overcome the first limitation, we learn motion by aligning high-level semantics in the latent space \textit{without any explicit knowledge}. To tackle the second, we introduce a \textit{self-disentangled learning} strategy that incorporates the latent token with the geometry token within a shared decoder, effectively disentangling low-level geometry and high-level semantics. In addition to the reconstruction objective, we employ three alignment objectives to enhance temporal understanding, including frame-level motion and video-level global information. We show that our pre-trained encoder surprisingly discriminates spatio-temporal representation without further fine-tuning. Extensive experiments on MSR-Action3D, NTU-RGBD, HOI4D, NvGesture, and SHREC'17 demonstrate the superiority of our approach in both coarse-grained and fine-grained 4D downstream tasks. Notably, Uni4D improves action segmentation accuracy on HOI4D by $+3.8\%$.
