Table of Contents
Fetching ...

GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

Jiayi Tian, Jiaze Wang

Abstract

Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62\%} accuracy), NTU RGBD (\textbf{+1.4\%} accuracy), and Synthia4D (\textbf{+1.8\%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.

GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation

Abstract

Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62\%} accuracy), NTU RGBD (\textbf{+1.4\%} accuracy), and Synthia4D (\textbf{+1.8\%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.
Paper Structure (45 sections, 20 equations, 7 figures, 6 tables)

This paper contains 45 sections, 20 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Motivation. Right: The right panel illustrates video sequences under different frame rate partitions: with a large temporal interval, some moving objects (second row) and static objects (third column) disappear, while they remain visible with a small temporal interval. This indicates that varying frame rate partitions may cause certain velocity features to vanish. Left: The left panel illustrates that GATS can adaptively adjust the scaling distribution across different frame rate partitions, thereby effectively mitigating the relative velocity bias introduced by frame rate variations and reducing fluctuations in accuracy. Consequently, GATS achieves improvements in ACC of 6.62% and 3.83% over P4D and PST, respectively.
  • Figure 2: Pipeline. The overall network backbone consists of two core modules: (a) UGGC Module. After the point cloud is fed into the network, the spatial variations of $x_i^t$ generate cross frame representations. However, different cross frames often lead to inter frame inconsistencies. The UGGC module extracts local Gaussian features and incorporates an uncertainty aware gating mechanism to jointly model geometric and Gaussian local features of 4D point clouds, thereby enhancing the robustness of feature extraction. (b) TSA Module. Under different frame rates, the estimation of relative velocity $_i^t$ varies, and as the temporal dimension progresses, motion features tend to produce inter frame inconsistencies. To address this, the TSA module introduces a learnable scaling factor $s$ to normalize temporal distances, achieving frame partition invariance and ensuring consistent relative velocity estimation across varying frame rates.
  • Figure 3: MSR-attention Results. Attention visualization showing focus on key regions and spatio-temporal dynamics.
  • Figure 4: 4D Qualitative Results. The rows from top to bottom correspond to the input, GT, and our predictions. Detailed comparative results are highlighted in the enlarged regions of the figure.
  • Figure 5: Analysis of accuracy with varying temporal and spatial parameters.
  • ...and 2 more figures