A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors
Vinh Chau, Khoa Le Dinh Van, Hon Huynh Ngoc, Binh Nguyen Thien, Hao Nguyen Thien, Vy Nguyen Quang, Phuc Vo Hong, Yen Lam Minh, Kieu Pham Tieu, Trinh Nguyen Thi Diem, Louise Thwaites, Hai Ho Bich
TL;DR
The paper tackles the interoperability gap in low-resource ICUs by digitizing vital signs from legacy bedside monitors using a lightweight, edge-friendly hierarchical computer vision pipeline. It localizes the monitor with YOLOv11, detects vital-sign ROIs, rectifies perspective, and performs OCR with PaddleOCR, achieving high end-to-end accuracy (≥98.9%) and real-time performance on modest hardware. Key contributions include a dual-YOLOv11 architecture for localization and ROI detection, a geometry-based rectification module, and a robust PP-OCRv5 extraction stage, validated on open datasets and 2,098 real-world ICU images. The approach demonstrates strong potential for practical deployment in LMICs, enabling rapid digitization of otherwise closed, non-networked monitor data without hardware replacement. Future work targets waveform digitization, edge-device optimization, and transformer-based detectors to further improve robustness and autonomy.
Abstract
In many low-resource healthcare settings, bedside monitors remain standalone legacy devices without network connectivity, creating a persistent interoperability gap that prevents seamless integration of physiological data into electronic health record (EHR) systems. To address this challenge without requiring costly hardware replacement, we present a computer vision-based pipeline for the automated capture and digitisation of vital sign data directly from bedside monitor screens. Our method employs a hierarchical detection framework combining YOLOv11 for accurate monitor and region of interest (ROI) localisation with PaddleOCR for robust text extraction. To enhance reliability across variable camera angles and lighting conditions, a geometric rectification module standardizes the screen perspective before character recognition. We evaluated the system on a dataset of 6,498 images collected from open-source corpora and real-world intensive care units in Vietnam. The model achieved a mean Average Precision (mAP@50-95) of 99.5% for monitor detection and 91.5% for vital sign ROI localisation. The end-to-end extraction accuracy exceeded 98.9% for core physiological parameters, including heart rate, oxygen saturation SpO2, and arterial blood pressure. These results demonstrate that a lightweight, camera-based approach can reliably transform unstructured information from screen captures into structured digital data, providing a practical and scalable pathway to improve information accessibility and clinical documentation in low-resource settings.
