ScaleViz: Scaling Visualization Recommendation Models on Large Data
Ghazi Shazan Ahmad, Shubham Agarwal, Subrata Mitra, Ryan Rossi, Manav Doshi, Vibhor Porwal, Syam Manoj Kumar Paila
TL;DR
ScaleViz tackles the scalability bottleneck in automated visualization recommendation by introducing a budget-aware reinforcement-learning pipeline. It combines a cost profiler that extrapolates per-feature computation costs with a double deep Q-learning agent that sequentially selects a subset of statistics under a user-defined time budget, preserving model performance. Evaluations on VizML and MLVR across four large datasets show up to $\approx 10\times$ speedups with minimal degradation in accuracy, and analyses reveal strong dataset-specific feature selection and favorable scaling behavior as data size grows. The approach enables practical, cost-effective Vis-Rec on large-scale datasets, aligning visualization quality with real-world time constraints.
Abstract
Automated visualization recommendations (vis-rec) help users to derive crucial insights from new datasets. Typically, such automated vis-rec models first calculate a large number of statistics from the datasets and then use machine-learning models to score or classify multiple visualizations choices to recommend the most effective ones, as per the statistics. However, state-of-the art models rely on very large number of expensive statistics and therefore using such models on large datasets become infeasible due to prohibitively large computational time, limiting the effectiveness of such techniques to most real world complex and large datasets. In this paper, we propose a novel reinforcement-learning (RL) based framework that takes a given vis-rec model and a time-budget from the user and identifies the best set of input statistics that would be most effective while generating the visual insights within a given time budget, using the given model. Using two state-of-the-art vis-rec models applied on three large real-world datasets, we show the effectiveness of our technique in significantly reducing time-to visualize with very small amount of introduced error. Our approach is about 10X times faster compared to the baseline approaches that introduce similar amounts of error.
