Table of Contents
Fetching ...

Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer

Shihao Zou, Yuxuan Mu, Wei Ji, Zi-An Wang, Xinxin Zuo, Sen Wang, Weixin Si, Li Cheng

TL;DR

The paper addresses efficient 3D human pose tracking using exclusively event cameras by introducing the first sparse SNN-based framework. It builds a Spiking Spatiotemporal Transformer that enables bi-directional fusion of spike features and adopts a Normalized Hamming similarity to measure attention between binary spikes, addressing challenges of early-time temporal information and ill-defined similarity. A large synthetic dataset, SynEventHPD, totaling $45.72$ hours, supports robust training and generalization, and experiments show the method achieves strong accuracy with only $19.1\%$ of the FLOPs and $3.6\%$ of the energy cost of state-of-the-art ANN methods, while outperforming existing SNN baselines. The work advances energy-efficient, real-time event-based pose tracking and provides code and data to enable reproducibility and further research.

Abstract

Event camera, as an asynchronous vision sensor capturing scene dynamics, presents new opportunities for highly efficient 3D human pose tracking. Existing approaches typically adopt modern-day Artificial Neural Networks (ANNs), such as CNNs or Transformer, where sparse events are converted into dense images or paired with additional gray-scale images as input. Such practices, however, ignore the inherent sparsity of events, resulting in redundant computations, increased energy consumption, and potentially degraded performance. Motivated by these observations, we introduce the first sparse Spiking Neural Networks (SNNs) framework for 3D human pose tracking based solely on events. Our approach eliminates the need to convert sparse data to dense formats or incorporate additional images, thereby fully exploiting the innate sparsity of input events. Central to our framework is a novel Spiking Spatiotemporal Transformer, which enables bi-directional spatiotemporal fusion of spike pose features and provides a guaranteed similarity measurement between binary spike features in spiking attention. Moreover, we have constructed a large-scale synthetic dataset, SynEventHPD, that features a broad and diverse set of 3D human motions, as well as much longer hours of event streams. Empirical experiments demonstrate the superiority of our approach over existing state-of-the-art (SOTA) ANN-based methods, requiring only 19.1% FLOPs and 3.6% energy cost. Furthermore, our approach outperforms existing SNN-based benchmarks in this task, highlighting the effectiveness of our proposed SNN framework. The dataset will be released upon acceptance, and code can be found at https://github.com/JimmyZou/HumanPoseTracking_SNN.

Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer

TL;DR

The paper addresses efficient 3D human pose tracking using exclusively event cameras by introducing the first sparse SNN-based framework. It builds a Spiking Spatiotemporal Transformer that enables bi-directional fusion of spike features and adopts a Normalized Hamming similarity to measure attention between binary spikes, addressing challenges of early-time temporal information and ill-defined similarity. A large synthetic dataset, SynEventHPD, totaling hours, supports robust training and generalization, and experiments show the method achieves strong accuracy with only of the FLOPs and of the energy cost of state-of-the-art ANN methods, while outperforming existing SNN baselines. The work advances energy-efficient, real-time event-based pose tracking and provides code and data to enable reproducibility and further research.

Abstract

Event camera, as an asynchronous vision sensor capturing scene dynamics, presents new opportunities for highly efficient 3D human pose tracking. Existing approaches typically adopt modern-day Artificial Neural Networks (ANNs), such as CNNs or Transformer, where sparse events are converted into dense images or paired with additional gray-scale images as input. Such practices, however, ignore the inherent sparsity of events, resulting in redundant computations, increased energy consumption, and potentially degraded performance. Motivated by these observations, we introduce the first sparse Spiking Neural Networks (SNNs) framework for 3D human pose tracking based solely on events. Our approach eliminates the need to convert sparse data to dense formats or incorporate additional images, thereby fully exploiting the innate sparsity of input events. Central to our framework is a novel Spiking Spatiotemporal Transformer, which enables bi-directional spatiotemporal fusion of spike pose features and provides a guaranteed similarity measurement between binary spike features in spiking attention. Moreover, we have constructed a large-scale synthetic dataset, SynEventHPD, that features a broad and diverse set of 3D human motions, as well as much longer hours of event streams. Empirical experiments demonstrate the superiority of our approach over existing state-of-the-art (SOTA) ANN-based methods, requiring only 19.1% FLOPs and 3.6% energy cost. Furthermore, our approach outperforms existing SNN-based benchmarks in this task, highlighting the effectiveness of our proposed SNN framework. The dataset will be released upon acceptance, and code can be found at https://github.com/JimmyZou/HumanPoseTracking_SNN.
Paper Structure (14 sections, 1 theorem, 19 equations, 10 figures, 5 tables)

This paper contains 14 sections, 1 theorem, 19 equations, 10 figures, 5 tables.

Key Result

Proposition 1

Define $\mathbf{q}_i, \mathbf{k}_j \in\mathbb{R}^{d_k}$ as the real-valued query and key. The corresponding binary embeddings $\mathbf{s}^q_i, \mathbf{s}^k_j \in \{0, 1\}^{C_k}$ are obtained as where $\mathbf{A}\in \mathbb{R}^{C_k \times d_k}$ is a projection matrix with each element generated independently from the normal distribution $\mathcal{N}(0, 1)$. Given that $\delta > 0$ and $C_k > \frac

Figures (10)

  • Figure 1: (a) Spiking neuron model. A LIF spiking neuron maintains a membrane potential and modifies it when receiving spiking trains from its connected neurons. The neuron will generate output spikes when its potential exceeds a threshold and then reset the potential. (b) Feedforward in SNNs. This process includes potential leaking and charging, neuron spiking and potential resetting. Feed-forwarding typically rolls over time and propagates from layer $l-1$ to layer $l$. (c) Backpropagation Through Time in SNNs. The gradients are normally computed through time and then back-propagated from layer $l$ to layer $l-1$.
  • Figure 2: Pipeline of our method. A novel Spiking Spatiotemporal Transformer is proposed to fuse spike pose features spatiotemporally, addressing one-directional temporal dependency in SNNs.
  • Figure 3: Spiking Spatiotemporal Transformer. Spiking Spatiotemporal Attention enables bi-directional flow of space-time information in SNNs. Attention Score of normalized Hamming similarity provides more accurate similarity measurement between binary spike tensors, in contrast to dot-product similarity utilized in prior works zhou2022spikformerzhou2023spikingformer.
  • Figure 4: Human Poses Regression. The input spike tensor first undergoes 2D average pooling and then goes through three linear layers in parallel to regress the global translation $\mathbf{d}$ and SMPL pose and shape parameters $\boldsymbol{\theta}, \boldsymbol{\beta}$ over all $T$ time steps.
  • Figure 5: t-SNE Visualization of Poses from each sub-dataset in our SynEventHPD dataset. MMHPSD gehrig2020video only covers a small area, while our SynEventHPD dataset, by including 4 sub-datasets (EventH36M, EventAMASS, EventPHSPD and SynMMHPSD), contains a wide range of poses. This highlights the rich variety of poses provided in SynEventHPD.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 1: Johnson–Lindenstrauss Lemma on Binary Embedding jacques2013robustyi2015binary