Exploring Scalability in Large-Scale Time Series in DeepVATS framework
Inmaculada Santamaria-Valenzuela, Victor Rodriguez-Fernandez, David Camacho
TL;DR
DeepVATS tackles scalable analysis of large time-series by integrating a DL-based embedding workflow with a VA-driven interactive visualization. It leverages a Python-based DL module, a Weights & Biases storage backend, and an R Shiny VA frontend to produce 2D projections of embeddings and linked time-series plots. The paper reports a scalability analysis using the Solar Power dataset, identifying GPU UMAP instability, Python–R communication overhead, and reactive Shiny bottlenecks as key performance challenges, and proposes caching and alternative DR pipelines as remedies. Results show good performance on small to moderate datasets but reveal stability and speed issues at million-element scales, guiding future enhancements. Overall, the work provides actionable insights for building scalable, deep-learning-assisted VA tools for time-series.
Abstract
Visual analytics is essential for studying large time series due to its ability to reveal trends, anomalies, and insights. DeepVATS is a tool that merges Deep Learning (Deep) with Visual Analytics (VA) for the analysis of large time series data (TS). It has three interconnected modules. The Deep Learning module, developed in R, manages the load of datasets and Deep Learning models from and to the Storage module. This module also supports models training and the acquisition of the embeddings from the latent space of the trained model. The Storage module operates using the Weights and Biases system. Subsequently, these embeddings can be analyzed in the Visual Analytics module. This module, based on an R Shiny application, allows the adjustment of the parameters related to the projection and clustering of the embeddings space. Once these parameters are set, interactive plots representing both the embeddings, and the time series are shown. This paper introduces the tool and examines its scalability through log analytics. The execution time evolution is examined while the length of the time series is varied. This is achieved by resampling a large data series into smaller subsets and logging the main execution and rendering times for later analysis of scalability.
