Table of Contents
Fetching ...

PRENet: A Plane-Fit Redundancy Encoding Point Cloud Sequence Network for Real-Time 3D Action Recognition

Shenglin He, Xiaoyang Qu, Jiguang Wan, Guokuan Li, Changsheng Xie, Jianzong Wang

TL;DR

This work tackles the challenge of real-time 3D action recognition from point cloud sequences by eliminating both spatial and temporal redundancies. The authors introduce PRENet, a dual-module framework consisting of Plane-Fit Embedding (PFE) to compress spatial/temporal redundancy within redundancy windows and Spatio-Temporal Consistency Encoding (STCE) to preserve alignment between spatial and temporal features. Through redundancy-windowing and plane propagation, PRENet achieves near-state-of-the-art accuracy with roughly fourfold faster inference on large-scale datasets like NTU RGB+D 60/120, and shows strong performance on MSR-Action 3D as well. The approach significantly improves practicality for real-time deployment in industrial and robotics contexts by reducing computational cost while maintaining discriminative spatio-temporal representations, as demonstrated by extensive ablations and dataset evaluations.

Abstract

Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.

PRENet: A Plane-Fit Redundancy Encoding Point Cloud Sequence Network for Real-Time 3D Action Recognition

TL;DR

This work tackles the challenge of real-time 3D action recognition from point cloud sequences by eliminating both spatial and temporal redundancies. The authors introduce PRENet, a dual-module framework consisting of Plane-Fit Embedding (PFE) to compress spatial/temporal redundancy within redundancy windows and Spatio-Temporal Consistency Encoding (STCE) to preserve alignment between spatial and temporal features. Through redundancy-windowing and plane propagation, PRENet achieves near-state-of-the-art accuracy with roughly fourfold faster inference on large-scale datasets like NTU RGB+D 60/120, and shows strong performance on MSR-Action 3D as well. The approach significantly improves practicality for real-time deployment in industrial and robotics contexts by reducing computational cost while maintaining discriminative spatio-temporal representations, as demonstrated by extensive ablations and dataset evaluations.

Abstract

Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.
Paper Structure (15 sections, 4 equations, 5 figures, 4 tables)

This paper contains 15 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) In human action point clouds, the normal vectors of each point remain approximately parallel to those of its adjacent points. (b) We employ two distinct methods, voxelization and plane fitting, to represent the raw point cloud data. Furthermore, we utilize distance error to assess the quantization discrepancies introduced during the representation process.
  • Figure 2: The overall architecture of out PRENet: PRENet consists of PFE module and STCE module. The PFE module is used to eliminate spatial and temporal redundancy in redundancy windows, while the STCE module combines the non-redundancy vectors in each window with their respective temporal structure.
  • Figure 3: The architecture of the Plane-Fit Embedding module (Redundancy Window Size=3) (a) We select a key frame in each redundancy window and construct several local regions in the key frame. For each local region, we fit a plane and then propagate these planes to the other frames in the window. (b) We use a Spatio-Temporal Consistency Encoding module to combine the spatial features of these successfully propagated planes with their temporal structure and obtain their features through a max pooling operation. (c) For the failed propagated planes, we fit new planes in these local regions corresponding to other frames. (d) In each frame, some points cannot be fitted to the planes. We use the PointNet to extract features of these points and then use the STCE module to fuse their temporal information.
  • Figure 4: Spatio-Temporal Consistency Encoding module (b) mainly comprises Spatio-Temporal Consistency Encoding Layer (a). The STCE Layer is primarily composed of two shared MLPs. The STCE module stacks three STCE Layers together and incorporates skip connections to extract features of different scales.
  • Figure 5: (a) The impact of different parameter settings on the recognition speed. (b) The impact of different parameter settings on the recognition accuracy.