Table of Contents
Fetching ...

Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling

Viet Dung Nguyen, Mobina Ghorbaninejad, Chengyi Ma, Reynold Bailey, Gabriel J. Diaz, Alexander Fix, Ryan J. Suess, Alexander Ororbia

Abstract

Eye feature extraction from event-based data streams can be performed efficiently and with low energy consumption, offering great utility to real-world eye tracking pipelines. However, few eye feature extractors are designed to handle sudden changes in event density caused by the changes between gaze behaviors that vary in their kinematics, leading to degraded prediction performance. In this work, we address this problem by introducing the \emph{adaptive inference state space model} (AISSM), a novel architecture for feature extraction that is capable of dynamically adjusting the relative weight placed on current versus recent information. This relative weighting is determined via estimates of the signal-to-noise ratio and event density produced by a complementary \emph{dynamic confidence network}. Lastly, we craft and evaluate a novel learning technique that improves training efficiency. Experimental results demonstrate that the AISSM system outperforms state-of-the-art models for event-based eye feature extraction.

Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling

Abstract

Eye feature extraction from event-based data streams can be performed efficiently and with low energy consumption, offering great utility to real-world eye tracking pipelines. However, few eye feature extractors are designed to handle sudden changes in event density caused by the changes between gaze behaviors that vary in their kinematics, leading to degraded prediction performance. In this work, we address this problem by introducing the \emph{adaptive inference state space model} (AISSM), a novel architecture for feature extraction that is capable of dynamically adjusting the relative weight placed on current versus recent information. This relative weighting is determined via estimates of the signal-to-noise ratio and event density produced by a complementary \emph{dynamic confidence network}. Lastly, we craft and evaluate a novel learning technique that improves training efficiency. Experimental results demonstrate that the AISSM system outperforms state-of-the-art models for event-based eye feature extraction.
Paper Structure (11 sections, 4 equations, 3 figures, 1 table)

This paper contains 11 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overall architecture of our adaptive inference state space model: (a) The core architecture. The encoder forms a posterior distribution of the current event information $q(s_{t-1})$ while the transition model utilizes a recurrent neural network (RNN) to produce a prior distribution over the next frame $p(s_t|s_{t-1},h_{t-1})$. The RNN output ($h_{t-1}$) is also used to update the prior at the next timestep. (b) The head is responsible for the estimation of the eye feature $\hat{y}$. Its input is an $\alpha$-weighted linear combination of the prior and posterior distributions over the latent state, i.e., it is a weighted summation of current and prior event information. (c) The dynamic confidence network predicts the weighting term $\alpha$ that is used in (b), which is an estimate of the current information's 'reliability'. (d) The inner workings of the AISSM's encoder module, which is a combination of CNNs and a multi-layered perceptron (MLP), whose output is reshaped into a $2$D (categorical distribution) matrix that represents the present data state. (e) The inner workings of the AISSM's transition module, which is a concatenation of the prior recurrent state ($h_{t-1}$) and the representation state ($s_{t-1}$) sampled from the previous posterior $q(s_{t-1})$. This input is passed through an MLP, forming the prior (categorical) distribution over past data $p(s_t)$.
  • Figure 2: The relationship between event density (blue) and successful prediction (hit) within the range of $10$ pixels (orange). The $x$-axis represents the time-steps within an event frame. Top depicts the CNN-GRU; bottom depicts the AISSM.
  • Figure 3: Our AISSM's validation accuracy when training with (orange) and without (blue) our proposed long-horizon training technique.