Table of Contents
Fetching ...

Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

Paul Kielty, Timothy Hanley, Peter Corcoran

TL;DR

Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics, is introduced, a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics.

Abstract

Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.

Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

TL;DR

Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics, is introduced, a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics.

Abstract

Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
Paper Structure (21 sections, 11 equations, 5 figures, 6 tables)

This paper contains 21 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Visualization of the LADS process described in Section \ref{['sec:measures']}. The input window of events, $\boldsymbol{\mathcal{E}}_k$, is accumulated into histogram $H_k$. The subsequent annotated steps are: (a) divide the histogram into a user-specified number of patches; (b) measure the signal dynamics of patch according to the selected LADS variant; (c) compute a decay value for each patch, which are interpolated to generate the per-pixel decay map $d_k$; (d) decay the previous surface, $S_{k-1}$, by multiplication with $d_k$ (element-wise). The resulting product is added to $H_k$ to construct the new surface, $S_k$.
  • Figure 2: Sample representations from an event video showing a blink on an otherwise still face. In the FFT examples, grid lines mark patch boundaries and a heatmap generated from the decay value assigned to each patch (before interpolation). With fewer patches, the recursive approach reduces computation while maintaining precise localization of the blink motion, which helps prevent nearby stationary features, such as the eyebrows, from receiving elevated decay values.
  • Figure 3: Each row shows a different stage of an event video: (a) during rapid head motion, (b) minimal head motion sustained for 0.5 s, and (c) minimal head motion sustained for over 2 s. Columns, left to right, are: histogram representation; time-surface with standard global LI; and time-surfaces generated by the three proposed adaptive integration methods (ER, LoG, FFT).
  • Figure 4: Examples of exclusions from the FES dataset with correct elements in green and incorrect elements in red: (a) Inconsistent landmark indexing. (b) Correct bounding box but incorrect landmark positions. (c) Incorrect bounding box but correct landmark positions.
  • Figure 5: LADS-ER representation of an event window featuring fast head motion from FES and Blink datasets at 30 Hz and 240 Hz.