Datasets of Visualization for Machine Learning
Can Liu, Ruike Jiang, Shaocong Tan, Jiacheng Yu, Chaofan Yang, Hanning Shao, Xiaoru Yuan
TL;DR
Visualization datasets enable ML-driven automation in visualization pipelines. The authors propose a three-dimensional what-why-how framework to structure and analyze these datasets, and they catalog prominent datasets (e.g., VizNet, FigureQA, ChartSense) by content, tasks, and construction. They identify core challenges, notably heterogeneity of data formats and limited scale, and propose directions around standardization, openness, and intelligent annotation to facilitate wider, reproducible use. Overall, the work provides a foundation for consistent dataset design and cross-study benchmarking in AI-assisted visualization.
Abstract
Datasets of visualization play a crucial role in automating data-driven visualization pipelines, serving as the foundation for supervised model training and algorithm benchmarking. In this paper, we survey the literature on visualization datasets and provide a comprehensive overview of existing visualization datasets, including their data types, formats, supported tasks, and openness. We propose a what-why-how model for visualization datasets, considering the content of the dataset (what), the supported tasks (why), and the dataset construction process (how). This model provides a clear understanding of the diversity and complexity of visualization datasets. Additionally, we highlight the challenges faced by existing visualization datasets, including the lack of standardization in data types and formats and the limited availability of large-scale datasets. To address these challenges, we suggest future research directions.
