Table of Contents
Fetching ...

JaneEye: A 12-nm 2K-FPS 18.9-$μ$J/Frame Event-based Eye Tracking Accelerator

Tao Han, Ang Li, Qinyu Chen, Chang Gao

TL;DR

JaneEye tackles the challenge of energy-efficient, high-speed eye tracking for XR wearables by converting asynchronous event streams into dense frames and processing them with an ultra-light ConvJANET-based network on a 12-nm ASIC. The method combines time-based and count-based event aggregation, a compact neural architecture with ConvJANET, GMLP, and a pupil localization head, and hardware-aware optimizations including activation approximations and mixed-precision quantization, coupled with progressive retraining. The resulting system achieves 2.45 pixel pupil accuracy on the 3ET+ dataset with 17.6K parameters, up to 1250 Hz event frame rate, end-to-end latency 0.5 ms at 2000 FPS, energy 18.9 μJ/frame, and 567 GOP/s/W efficiency on a 12-nm ASIC with a 64-PE array. This places JaneEye ahead of state-of-the-art eye trackers in energy-delay product while maintaining competitive accuracy, making real-time eye tracking viable for wearable XR devices. The work demonstrates a strong software-hardware co-design trajectory for sparse, event-based perception in resource-constrained environments.

Abstract

Eye tracking has become a key technology for gaze-based interactions in Extended Reality (XR). However, conventional frame-based eye-tracking systems often fall short of XR's stringent requirements for high accuracy, low latency, and energy efficiency. Event cameras present a compelling alternative, offering ultra-high temporal resolution and low power consumption. In this paper, we present JaneEye, an energy-efficient event-based eye-tracking hardware accelerator designed specifically for wearable devices, leveraging sparse, high-temporal-resolution event data. We introduce an ultra-lightweight neural network architecture featuring a novel ConvJANET layer, which simplifies the traditional ConvLSTM by retaining only the forget gate, thereby halving computational complexity without sacrificing temporal modeling capability. Our proposed model achieves high accuracy with a pixel error of 2.45 on the 3ET+ dataset, using only 17.6K parameters, with up to 1250 Hz event frame rate. To further enhance hardware efficiency, we employ custom linear approximations of activation functions (hardsigmoid and hardtanh) and fixed-point quantization. Through software-hardware co-design, our 12-nm ASIC implementation operates at 400 MHz, delivering an end-to-end latency of 0.5 ms (equivalent to 2000 Frames Per Second (FPS)) at an energy efficiency of 18.9 $μ$J/frame. JaneEye sets a new benchmark in low-power, high-performance eye-tracking solutions suitable for integration into next-generation XR wearables.

JaneEye: A 12-nm 2K-FPS 18.9-$μ$J/Frame Event-based Eye Tracking Accelerator

TL;DR

JaneEye tackles the challenge of energy-efficient, high-speed eye tracking for XR wearables by converting asynchronous event streams into dense frames and processing them with an ultra-light ConvJANET-based network on a 12-nm ASIC. The method combines time-based and count-based event aggregation, a compact neural architecture with ConvJANET, GMLP, and a pupil localization head, and hardware-aware optimizations including activation approximations and mixed-precision quantization, coupled with progressive retraining. The resulting system achieves 2.45 pixel pupil accuracy on the 3ET+ dataset with 17.6K parameters, up to 1250 Hz event frame rate, end-to-end latency 0.5 ms at 2000 FPS, energy 18.9 μJ/frame, and 567 GOP/s/W efficiency on a 12-nm ASIC with a 64-PE array. This places JaneEye ahead of state-of-the-art eye trackers in energy-delay product while maintaining competitive accuracy, making real-time eye tracking viable for wearable XR devices. The work demonstrates a strong software-hardware co-design trajectory for sparse, event-based perception in resource-constrained environments.

Abstract

Eye tracking has become a key technology for gaze-based interactions in Extended Reality (XR). However, conventional frame-based eye-tracking systems often fall short of XR's stringent requirements for high accuracy, low latency, and energy efficiency. Event cameras present a compelling alternative, offering ultra-high temporal resolution and low power consumption. In this paper, we present JaneEye, an energy-efficient event-based eye-tracking hardware accelerator designed specifically for wearable devices, leveraging sparse, high-temporal-resolution event data. We introduce an ultra-lightweight neural network architecture featuring a novel ConvJANET layer, which simplifies the traditional ConvLSTM by retaining only the forget gate, thereby halving computational complexity without sacrificing temporal modeling capability. Our proposed model achieves high accuracy with a pixel error of 2.45 on the 3ET+ dataset, using only 17.6K parameters, with up to 1250 Hz event frame rate. To further enhance hardware efficiency, we employ custom linear approximations of activation functions (hardsigmoid and hardtanh) and fixed-point quantization. Through software-hardware co-design, our 12-nm ASIC implementation operates at 400 MHz, delivering an end-to-end latency of 0.5 ms (equivalent to 2000 Frames Per Second (FPS)) at an energy efficiency of 18.9 J/frame. JaneEye sets a new benchmark in low-power, high-performance eye-tracking solutions suitable for integration into next-generation XR wearables.

Paper Structure

This paper contains 30 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: End-to-end flowchart of the proposed JaneEye eye-tracking system. The pipeline consists of three main stages. (1) Data Collection: An event camera captures sparse spatiotemporal data, generating a point cloud of events. (2) Preprocessing: This asynchronous point cloud is converted into dense 2D 'event-based frames' using two alternative aggregation methods: 'Slice by time' ($\Delta T$) or 'Slice by event count' ($N$ events). (3) JaneEye-Net Neural Network & JaneEye Hardware Acceleration: The resulting frame is fed into the lightweight JaneEye-Net, which uses three convolutional layers (Conv1-3) for spatial feature extraction, a Gated MLP, and our novel ConvJANET layer for spatiotemporal modeling. Finally, a Global Pooling and Fully Connected (FC) layer regress the (x, y) pupil coordinates. The JaneEye ASIC accelerates the JaneEye-Net eye-tracking neural network.
  • Figure 2: Microarchitecture of the proposed JaneEye hardware accelerator. The design is managed by a Top Controller and features dedicated on-chip SRAMs for Activations, Weights, and Biases to support high-bandwidth parallel memory access. A Data Dispatcher broadcasts data to the main computational core, which is organized as an array of 8 parallel Output Tiles. Each tile contains 8 PEs, and their partial sum outputs are aggregated by a dedicated Adder Tree. This 64-PE (8$\times$8) array performs the bulk of the MAC operations. The final results are passed to an Activation Core for nonlinear function processing.
  • Figure 3: Detailed architecture of a single PE for MAC operations. It features a 9 $\times$ 8-bit Weight Register, which allows for storing and reusing an entire 3$\times$3 convolution kernel locally, minimizing data movement. In a processing cycle, an 8-bit weight from the register is multiplied with a 16-bit activation. The 24-bit result is fed to a 32-bit adder. A multiplexer (MUX) selects whether to add this product to the previously accumulated value from the 32-bit Psum Register (accumulation step) or to '0' (to start a new computation). The final 32-bit partial sum (Psum) is passed to a Rounding & Truncate unit to produce the 16-bit output.
  • Figure 4: Post-layout specification