Table of Contents
Fetching ...

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Neeraj Solanki, Hong Ding, Sepehr Tabrizchi, Ali Shafiee Sarvestani, Shaahin Angizi, David Z. Pan, Arman Roohi

Abstract

Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Abstract

Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by . Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.
Paper Structure (22 sections, 7 equations, 6 figures, 2 tables)

This paper contains 22 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a) GLANCE overview: DWN-based gaze estimation drives union-of-ROIs attention; a periodic/conditional policy ships a single mosaic crop to the detector. (b) MCU/host dataflow. The host forms a per-frame spatial union to produce one union-of-ROIs mosaic and maintains a temporal state $U_t$ to trigger YOLO under a periodic/budget policy.
  • Figure 2: Proposed DWN architecture.
  • Figure 3: (a) ROI scale from gaze uncertainty. ROI side length as a function of hit probability $p$, with $S(p) ={2ep}$ and $e = 46.7$ px as the projected uncertainty for an $8.3^{\circ}$ gaze error. (b) ROI selection from gaze uncertainty. Visualization of the 2D gaze uncertainty field ($\mu{=}8.3^{\circ}$, $e{\approx}46.7$ px) with circular ROIs for $p{=}\{0.5,0.7,0.9\}$ mapped to frame sizes of (48$^2$, 64$^2$, 80$^2$ px).
  • Figure 4: GLANCE end-to-end system delay across varying ROI sizes: (a) 48$\times$48, (b) 64$\times$64, and (c) 80$\times$80, measured on the Arduino Nano 33 BLE (microcontroller) and NVIDIA RTX 3090 (host), w.r.t. the required processing steps -- .
  • Figure 5: (a) Memory footprint and (b) energy improvement of GLANCE vs. the conventional approach at the MCU end.
  • ...and 1 more figures