Table of Contents
Fetching ...

HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating

Weidong Hao

TL;DR

This work targets efficient event-based action recognition (EAR) by addressing computational and structural redundancies in dense voxel representations and exploiting global spectral information. It introduces Compact Holographic Spatiotemporal Representation (CHSR), which embeds horizontal spatial cues into the Time-Height ($T$-$H$) view to realize multi-view perception within a single 2D representation, and Global Spectral Gating (GSG), which uses FFT-based global token mixing to capture long-range motion patterns with minimal parameter overhead. The approach is instantiated as HoloEv-Net with two variants: a high-performance Base and a highly efficient Small, achieving state-of-the-art accuracy on THU-EACT-50-CHL, HARDVS, and DailyDVS-200, while offering dramatic reductions in parameters, FLOPs, and latency for edge deployment. Key contributions include the CHSR formulation, the GSG module with learnable spectral weights and gated reconstruction, and comprehensive ablations demonstrating the benefits of holographic embedding and frequency-domain modeling. The work’s practical impact lies in enabling real-time, energy-efficient EAR on resource-constrained platforms without sacrificing accuracy, and it lays groundwork for extending holographic and spectral techniques to other event-based vision tasks.

Abstract

Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.

HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating

TL;DR

This work targets efficient event-based action recognition (EAR) by addressing computational and structural redundancies in dense voxel representations and exploiting global spectral information. It introduces Compact Holographic Spatiotemporal Representation (CHSR), which embeds horizontal spatial cues into the Time-Height (-) view to realize multi-view perception within a single 2D representation, and Global Spectral Gating (GSG), which uses FFT-based global token mixing to capture long-range motion patterns with minimal parameter overhead. The approach is instantiated as HoloEv-Net with two variants: a high-performance Base and a highly efficient Small, achieving state-of-the-art accuracy on THU-EACT-50-CHL, HARDVS, and DailyDVS-200, while offering dramatic reductions in parameters, FLOPs, and latency for edge deployment. Key contributions include the CHSR formulation, the GSG module with learnable spectral weights and gated reconstruction, and comprehensive ablations demonstrating the benefits of holographic embedding and frequency-domain modeling. The work’s practical impact lies in enabling real-time, energy-efficient EAR on resource-constrained platforms without sacrificing accuracy, and it lays groundwork for extending holographic and spectral techniques to other event-based vision tasks.

Abstract

Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.
Paper Structure (27 sections, 5 equations, 4 figures, 5 tables)

This paper contains 27 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance vs. Cost on DailyDVS-200wang2024dailydvs. Bubble size denotes model parameters. Our HoloEv-Net (red) outperforms existing methods, demonstrating an optimal balance of accuracy and efficiency.
  • Figure 2: Overview of the proposed HoloEv-Net. The raw event stream is first processed by our Compact Holographic Spatiotemporal Representation (CHSR). The extracted features from the backbone are processed by the Global Spectral Gating (GSG) module.
  • Figure 3: Visualization of the proposed CHSR. Unlike standard multi-view projections (middle) that suffer from information loss, our CHSR constructs a comprehensive representation in the $T$-$H$ domain. It consists of three channels: (1) Density Map (+) and (2) Density Map (-), which accumulate polarity-specific events to record motion trajectories; and (3) Holographic Map, which recovers the horizontal spatial cues (different color denotes different horizontal position) typically lost in $T$-$H$ projections.
  • Figure 4: Frequency analysis of event streams. The figure displays the event rate and FFT spectrum for two distinct actions. The distinct main frequencies (3.21 Hz for running, 1.32 Hz for nodding) highlight the discriminability of actions in the frequency domain.