Datasets of Visualization for Machine Learning

Can Liu; Ruike Jiang; Shaocong Tan; Jiacheng Yu; Chaofan Yang; Hanning Shao; Xiaoru Yuan

Datasets of Visualization for Machine Learning

Can Liu, Ruike Jiang, Shaocong Tan, Jiacheng Yu, Chaofan Yang, Hanning Shao, Xiaoru Yuan

TL;DR

Visualization datasets enable ML-driven automation in visualization pipelines. The authors propose a three-dimensional what-why-how framework to structure and analyze these datasets, and they catalog prominent datasets (e.g., VizNet, FigureQA, ChartSense) by content, tasks, and construction. They identify core challenges, notably heterogeneity of data formats and limited scale, and propose directions around standardization, openness, and intelligent annotation to facilitate wider, reproducible use. Overall, the work provides a foundation for consistent dataset design and cross-study benchmarking in AI-assisted visualization.

Abstract

Datasets of visualization play a crucial role in automating data-driven visualization pipelines, serving as the foundation for supervised model training and algorithm benchmarking. In this paper, we survey the literature on visualization datasets and provide a comprehensive overview of existing visualization datasets, including their data types, formats, supported tasks, and openness. We propose a what-why-how model for visualization datasets, considering the content of the dataset (what), the supported tasks (why), and the dataset construction process (how). This model provides a clear understanding of the diversity and complexity of visualization datasets. Additionally, we highlight the challenges faced by existing visualization datasets, including the lack of standardization in data types and formats and the limited availability of large-scale datasets. To address these challenges, we suggest future research directions.

Datasets of Visualization for Machine Learning

TL;DR

Abstract

Paper Structure (22 sections, 14 figures, 1 table)

This paper contains 22 sections, 14 figures, 1 table.

Introduction
Related Surveys
Methodology
Definition and Scope
Coding
What: Content of Dataset
Underlying Data
Visualization Components
Visualization Presentation
Additional Information
Why: Usage of Dataset
Basic Techniques
General Tasks
User Tasks
Findings
...and 7 more sections

Figures (14)

Figure 1: The what-why-how model for the dataset of visualizations. What describes the content of the visualization, why describes the usage of the datasets, and how describes the construction methods of these datasets.
Figure 2: Relationship of different datasets in "What". The underlying data are objects for visual mapping and rendering to build visualization components. These components are combined and laid out to compose a visualization presentation, which can be enhanced with NL, such as descriptions and annotations. NL can be an interface to create, query, comment, and give feedback on a visualization. In addition, different forms of visualization and their components can be used for tasks such as user perception experiments and machine learning training.
Figure 3: Example of underlying data and its visualization presentation viznet.
Figure 4: The 10-category chart image corpus of Revision revision.
Figure 5: The output of the fully-automatic annotation system in visually29K Madan2021.
...and 9 more figures

Datasets of Visualization for Machine Learning

TL;DR

Abstract

Datasets of Visualization for Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)