Table of Contents
Fetching ...

Ocularone-Bench: Benchmarking DNN Models on GPUs to Assist the Visually Impaired

Suman Raj, Bhavani A Madhabhavi, Kautuk Astu, Arnav A Rajesh, Pratham M, Yogesh Simmhan

TL;DR

Ocularone-Bench tackles VIP navigation by benchmarking DNN inference across edge devices and high-end GPUs using hazard-vest-based identification. The authors curate a large hazard-vest dataset, retrain YOLOv8 and YOLOv11 variants, and evaluate accuracy-latency trade-offs for vision-based VIP assistance with pose and depth components. Key findings show up to roughly 99.5% precision on diverse data and sub-25 ms inference on workstation hardware, with edge devices achieving real-time performance when using smaller models. The work demonstrates practical feasibility for safe mobile VIP navigation and offers a path toward multi-modal, adaptive deployments in edge-cloud ecosystems.

Abstract

VIP navigation requires multiple DNN models for identification, posture analysis, and depth estimation to ensure safe mobility. Using a hazard vest as a unique identifier enhances visibility while selecting the right DNN model and computing device balances accuracy and real-time performance. We present Ocularone-Bench, which is a benchmark suite designed to address the lack of curated datasets for uniquely identifying individuals in crowded environments and the need for benchmarking DNN inference times on resource-constrained edge devices. The suite evaluates the accuracy-latency trade-offs of YOLO models retrained on this dataset and benchmarks inference times of situation awareness models across edge accelerators and high-end GPU workstations. Our study on NVIDIA Jetson devices and RTX 4090 workstation demonstrates significant improvements in detection accuracy, achieving up to 99.4% precision, while also providing insights into real-time feasibility for mobile deployment. Beyond VIP navigation, Ocularone-Bench is applicable to senior citizens, children and worker safety monitoring, and other vision-based applications.

Ocularone-Bench: Benchmarking DNN Models on GPUs to Assist the Visually Impaired

TL;DR

Ocularone-Bench tackles VIP navigation by benchmarking DNN inference across edge devices and high-end GPUs using hazard-vest-based identification. The authors curate a large hazard-vest dataset, retrain YOLOv8 and YOLOv11 variants, and evaluate accuracy-latency trade-offs for vision-based VIP assistance with pose and depth components. Key findings show up to roughly 99.5% precision on diverse data and sub-25 ms inference on workstation hardware, with edge devices achieving real-time performance when using smaller models. The work demonstrates practical feasibility for safe mobile VIP navigation and offers a path toward multi-modal, adaptive deployments in edge-cloud ecosystems.

Abstract

VIP navigation requires multiple DNN models for identification, posture analysis, and depth estimation to ensure safe mobility. Using a hazard vest as a unique identifier enhances visibility while selecting the right DNN model and computing device balances accuracy and real-time performance. We present Ocularone-Bench, which is a benchmark suite designed to address the lack of curated datasets for uniquely identifying individuals in crowded environments and the need for benchmarking DNN inference times on resource-constrained edge devices. The suite evaluates the accuracy-latency trade-offs of YOLO models retrained on this dataset and benchmarks inference times of situation awareness models across edge accelerators and high-end GPU workstations. Our study on NVIDIA Jetson devices and RTX 4090 workstation demonstrates significant improvements in detection accuracy, achieving up to 99.4% precision, while also providing insights into real-time feasibility for mobile deployment. Beyond VIP navigation, Ocularone-Bench is applicable to senior citizens, children and worker safety monitoring, and other vision-based applications.

Paper Structure

This paper contains 12 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Accuracy of YOLOv11 (medium) trained using $1k$ random (top) and $3.8k$ curated (bottom) hazard-vest images
  • Figure 2: Sample images from the dataset
  • Figure 3: Accuracy (in %) of VIP detection using different sizes of Re-trained (RT) YOLOv8 (top) and YOLOv11 (bottom) on diverse datasets
  • Figure 4: Accuracy (in %) of VIP detection using different sizes of Re-trained (RT) YOLOv8 (top) and YOLOv11 (bottom) on adversarial datasets
  • Figure 5: Inference Times on Jetson Edge Accelerators
  • ...and 1 more figures