PTB-XL-Image-17K: A Large-Scale Synthetic ECG Image Dataset with Comprehensive Ground Truth for Deep Learning-Based Digitization
Naqcho Ali Mehdi
TL;DR
This work addresses the lack of large-scale, ground-truth ECG image datasets suitable for end-to-end digitization by introducing PTB-XL-Image-17K, a synthetic dataset of 17,271 high-quality 12-lead ECG images with complete ground truth across images, segmentation masks, time-series signals, YOLO-format bounding boxes, and rich metadata. It provides an open-source generation framework with controllable parameters to simulate diverse recording conditions, including lead regions and lead-name annotations, and validates high-fidelity signal reconstruction and accurate localization. The dataset supports end-to-end digitization tasks and overlapping waveform research, offering robust baselines for lead detection, waveform segmentation, and pixel-to-signal calibration, with strong performance metrics (IoU >0.90, correlation >0.998). By making both data and framework publicly available, it aims to accelerate development of automated ECG digitization pipelines applicable to legacy archives, telemedicine, and multi-modal learning, while outlining future extensions to more layouts and real-scanned validation.
Abstract
Electrocardiogram (ECG) digitization-converting paper-based or scanned ECG images back into time-series signals-is critical for leveraging decades of legacy clinical data in modern deep learning applications. However, progress has been hindered by the lack of large-scale datasets providing both ECG images and their corresponding ground truth signals with comprehensive annotations. We introduce PTB-XL-Image-17K, a complete synthetic ECG image dataset comprising 17,271 high-quality 12-lead ECG images generated from the PTB-XL signal database. Our dataset uniquely provides five complementary data types per sample: (1) realistic ECG images with authentic grid patterns and annotations (50% with visible grid, 50% without), (2) pixel-level segmentation masks, (3) ground truth time-series signals, (4) bounding box annotations in YOLO format for both lead regions and lead name labels, and (5) comprehensive metadata including visual parameters and patient information. We present an open-source Python framework enabling customizable dataset generation with controllable parameters including paper speed (25/50 mm/s), voltage scale (5/10 mm/mV), sampling rate (500 Hz), grid appearance (4 colors), and waveform characteristics. The dataset achieves 100% generation success rate with an average processing time of 1.35 seconds per sample. PTB-XL-Image-17K addresses critical gaps in ECG digitization research by providing the first large-scale resource supporting the complete pipeline: lead detection, waveform segmentation, and signal extraction with full ground truth for rigorous evaluation. The dataset, generation framework, and documentation are publicly available at https://github.com/naqchoalimehdi/PTB-XL-Image-17K and https://doi.org/10.5281/zenodo.18197519.
