GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Neeraj Solanki; Hong Ding; Sepehr Tabrizchi; Ali Shafiee Sarvestani; Shaahin Angizi; David Z. Pan; Arman Roohi

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Neeraj Solanki, Hong Ding, Sepehr Tabrizchi, Ali Shafiee Sarvestani, Shaahin Angizi, David Z. Pan, Arman Roohi

Abstract

Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10\,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Abstract

with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50\% and energy consumption by 65\%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1\% mAP on COCO (51.8\% on attended objects) while maintaining sub-10\,ms latency, meeting stringent AR/VR requirements by improving the communication time by

. Compared to the global YOLOv12n baseline, which achieves 39.2\%, 63.4\%, and 83.1\% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3\%, 72.1\%, and 88.1\% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

Paper Structure (22 sections, 7 equations, 6 figures, 2 tables)

This paper contains 22 sections, 7 equations, 6 figures, 2 tables.

Introduction
Background
Weightless Neural Networks: Memory-Centric Computation
Classical and Differentiable Variants
Object Detection: From Proposals to Real Time
System Overview and Dataflow
Weightless Gaze Estimation
From Gaze Estimation to ROI Formation
Attention-Guided ROI-Based Object Detection
Protocol (spatial, per-frame)
Metric
Behavior by scale
Temporal ROI Accumulation (Policy Study)
Rotation- and Motion-Aware ROI Stabilization
Head-rotation–aware realignment
...and 7 more sections

Figures (6)

Figure 1: (a) GLANCE overview: DWN-based gaze estimation drives union-of-ROIs attention; a periodic/conditional policy ships a single mosaic crop to the detector. (b) MCU/host dataflow. The host forms a per-frame spatial union to produce one union-of-ROIs mosaic and maintains a temporal state $U_t$ to trigger YOLO under a periodic/budget policy.
Figure 2: Proposed DWN architecture.
Figure 3: (a) ROI scale from gaze uncertainty. ROI side length as a function of hit probability $p$, with $S(p) ={2ep}$ and $e = 46.7$ px as the projected uncertainty for an $8.3^{\circ}$ gaze error. (b) ROI selection from gaze uncertainty. Visualization of the 2D gaze uncertainty field ($\mu{=}8.3^{\circ}$, $e{\approx}46.7$ px) with circular ROIs for $p{=}\{0.5,0.7,0.9\}$ mapped to frame sizes of (48$^2$, 64$^2$, 80$^2$ px).
Figure 4: GLANCE end-to-end system delay across varying ROI sizes: (a) 48$\times$48, (b) 64$\times$64, and (c) 80$\times$80, measured on the Arduino Nano 33 BLE (microcontroller) and NVIDIA RTX 3090 (host), w.r.t. the required processing steps -- .
Figure 5: (a) Memory footprint and (b) energy improvement of GLANCE vs. the conventional approach at the MCU end.
...and 1 more figures

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Abstract

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

Authors

Abstract

Table of Contents

Figures (6)