BlissCam: Boosting Eye Tracking Efficiency with Learned In-Sensor Sparse Sampling
Yu Feng, Tianrui Ma, Yuhao Zhu, Xuan Zhang
TL;DR
This work tackles the high latency and power cost of near-eye eye-tracking by proposing BlissCam, a co-designed system that performs learned sparse sampling entirely within the image sensor. By downsampling to about 5% of pixels and first identifying ROI via inter-frame eventification before random ROI sampling, the approach reduces sensor readout, MIPI transfer, and off-sensor processing, while maintaining gaze accuracy. An in-sensor NPU executes a lightweight ROI predictor, and a Vision Transformer-based segmentation on sparse inputs preserves accuracy when faced with reduced data; the two components are trained end-to-end with differentiable objectives. Results show up to 8.2x energy savings and 1.4x latency reduction, with about 95% pixel reduction and minimal accuracy loss, demonstrating a practical, hardware-efficient pathway for end-to-end eye-tracking optimization in AR/VR devices.
Abstract
Eye tracking is becoming an increasingly important task domain in emerging computing platforms such as Augmented/Virtual Reality (AR/VR). Today's eye tracking system suffers from long end-to-end tracking latency and can easily eat up half of the power budget of a mobile VR device. Most existing optimization efforts exclusively focus on the computation pipeline by optimizing the algorithm and/or designing dedicated accelerators while largely ignoring the front-end of any eye tracking pipeline: the image sensor. This paper makes a case for co-designing the imaging system with the computing system. In particular, we propose the notion of "in-sensor sparse sampling", whereby the pixels are drastically downsampled (by 20x) within the sensor. Such in-sensor sampling enhances the overall tracking efficiency by significantly reducing 1) the power consumption of the sensor readout chain and sensor-host communication interfaces, two major power contributors, and 2) the work done on the host, which receives and operates on far fewer pixels. With careful reuse of existing pixel circuitry, our proposed BLISSCAM requires little hardware augmentation to support the in-sensor operations. Our synthesis results show up to 8.2x energy reduction and 1.4x latency reduction over existing eye tracking pipelines.
