Table of Contents
Fetching ...

BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling

Yu Feng, Tianrui Ma, Yuhao Zhu, Xuan Zhang

TL;DR

This work tackles the high latency and power cost of near-eye eye-tracking by proposing BlissCam, a co-designed system that performs learned sparse sampling entirely within the image sensor. By downsampling to about 5% of pixels and first identifying ROI via inter-frame eventification before random ROI sampling, the approach reduces sensor readout, MIPI transfer, and off-sensor processing, while maintaining gaze accuracy. An in-sensor NPU executes a lightweight ROI predictor, and a Vision Transformer-based segmentation on sparse inputs preserves accuracy when faced with reduced data; the two components are trained end-to-end with differentiable objectives. Results show up to 8.2x energy savings and 1.4x latency reduction, with about 95% pixel reduction and minimal accuracy loss, demonstrating a practical, hardware-efficient pathway for end-to-end eye-tracking optimization in AR/VR devices.

Abstract

Eye tracking is becoming an increasingly important task domain in emerging computing platforms such as Augmented/Virtual Reality (AR/VR). Today's eye tracking system suffers from long end-to-end tracking latency and can easily eat up half of the power budget of a mobile VR device. Most existing optimization efforts exclusively focus on the computation pipeline by optimizing the algorithm and/or designing dedicated accelerators while largely ignoring the front-end of any eye tracking pipeline: the image sensor. This paper makes a case for co-designing the imaging system with the computing system. In particular, we propose the notion of "in-sensor sparse sampling", whereby the pixels are drastically downsampled (by 20x) within the sensor. Such in-sensor sampling enhances the overall tracking efficiency by significantly reducing 1) the power consumption of the sensor readout chain and sensor-host communication interfaces, two major power contributors, and 2) the work done on the host, which receives and operates on far fewer pixels. With careful reuse of existing pixel circuitry, our proposed BLISSCAM requires little hardware augmentation to support the in-sensor operations. Our synthesis results show up to 8.2x energy reduction and 1.4x latency reduction over existing eye tracking pipelines.

BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling

TL;DR

This work tackles the high latency and power cost of near-eye eye-tracking by proposing BlissCam, a co-designed system that performs learned sparse sampling entirely within the image sensor. By downsampling to about 5% of pixels and first identifying ROI via inter-frame eventification before random ROI sampling, the approach reduces sensor readout, MIPI transfer, and off-sensor processing, while maintaining gaze accuracy. An in-sensor NPU executes a lightweight ROI predictor, and a Vision Transformer-based segmentation on sparse inputs preserves accuracy when faced with reduced data; the two components are trained end-to-end with differentiable objectives. Results show up to 8.2x energy savings and 1.4x latency reduction, with about 95% pixel reduction and minimal accuracy loss, demonstrating a practical, hardware-efficient pathway for end-to-end eye-tracking optimization in AR/VR devices.

Abstract

Eye tracking is becoming an increasingly important task domain in emerging computing platforms such as Augmented/Virtual Reality (AR/VR). Today's eye tracking system suffers from long end-to-end tracking latency and can easily eat up half of the power budget of a mobile VR device. Most existing optimization efforts exclusively focus on the computation pipeline by optimizing the algorithm and/or designing dedicated accelerators while largely ignoring the front-end of any eye tracking pipeline: the image sensor. This paper makes a case for co-designing the imaging system with the computing system. In particular, we propose the notion of "in-sensor sparse sampling", whereby the pixels are drastically downsampled (by 20x) within the sensor. Such in-sensor sampling enhances the overall tracking efficiency by significantly reducing 1) the power consumption of the sensor readout chain and sensor-host communication interfaces, two major power contributors, and 2) the work done on the host, which receives and operates on far fewer pixels. With careful reuse of existing pixel circuitry, our proposed BLISSCAM requires little hardware augmentation to support the in-sensor operations. Our synthesis results show up to 8.2x energy reduction and 1.4x latency reduction over existing eye tracking pipelines.
Paper Structure (56 sections, 1 equation, 17 figures, 1 table)

This paper contains 56 sections, 1 equation, 17 figures, 1 table.

Figures (17)

  • Figure 1: A typical eye tracking pipeline, which starts from image sensing (exposure and readout) to obtain an near-eye image, which is transferred to the host processor through the MIPI CSI-2 interface. The host processor first segments important eye parts, from which the gaze is estimated. Different frames are overlapped to improve tracking frequency. Figure not drawn to scale; readout delay is usually three orders of magnitude shorter than the exposure time.
  • Figure 2: The computational capabilities, quantified in GFLOPS, of today's mobile GPUs (using Nvidia Jetson series as examples) vs. the computational demands of state-of-the-art eye tracking algorithms (assuming a tracking rate of 120 FPS).
  • Figure 3: MIPI communicating latency under different image resolutions. The red line shows the eye tracking latency requirement (15 ms).
  • Figure 4: Percentage of image sensor power attributed to by the readout circuitry; data from six recent sensors park2019640park2020ultrapark2021cmossingh202134seo20222ikeno2023evolution.
  • Figure 5: Our sparse sampling-based eye tracking pipeline. Each frame first gets sampled by our sparse sampling algorithm inside the sensor to dramatically reduce the sensor-host data volume (Sec. \ref{['sec:algo:ss']}); the sampled pixels then go through a sparse eye segmentation algorithm on the host, which is designed to be robust against sparse inputs (Sec. \ref{['sec:algo:dnn']}). The ROI prediction algorithm and the sparse segmentation algorithm are jointly trained to maximize end-to-end tracking accuracy (Sec. \ref{['sec:algo:train']}).
  • ...and 12 more figures