HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating
Weidong Hao
TL;DR
This work targets efficient event-based action recognition (EAR) by addressing computational and structural redundancies in dense voxel representations and exploiting global spectral information. It introduces Compact Holographic Spatiotemporal Representation (CHSR), which embeds horizontal spatial cues into the Time-Height ($T$-$H$) view to realize multi-view perception within a single 2D representation, and Global Spectral Gating (GSG), which uses FFT-based global token mixing to capture long-range motion patterns with minimal parameter overhead. The approach is instantiated as HoloEv-Net with two variants: a high-performance Base and a highly efficient Small, achieving state-of-the-art accuracy on THU-EACT-50-CHL, HARDVS, and DailyDVS-200, while offering dramatic reductions in parameters, FLOPs, and latency for edge deployment. Key contributions include the CHSR formulation, the GSG module with learnable spectral weights and gated reconstruction, and comprehensive ablations demonstrating the benefits of holographic embedding and frequency-domain modeling. The work’s practical impact lies in enabling real-time, energy-efficient EAR on resource-constrained platforms without sacrificing accuracy, and it lays groundwork for extending holographic and spectral techniques to other event-based vision tasks.
Abstract
Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.
