Automating Data Science Pipelines with Tensor Completion

Shaan Pakala; Bryce Graw; Dawon Ahn; Tam Dinh; Mehnaz Tabassum Mahin; Vassilis Tsotras; Jia Chen; Evangelos E. Papalexakis

Automating Data Science Pipelines with Tensor Completion

Shaan Pakala, Bryce Graw, Dawon Ahn, Tam Dinh, Mehnaz Tabassum Mahin, Vassilis Tsotras, Jia Chen, Evangelos E. Papalexakis

TL;DR

The effectiveness of tensor completion as a tool for automating data science pipelines is demonstrated by demonstrating the effectiveness of hyperparameter optimization for non-neural network models, neural architecture search, and variants of query cardinality estimation.

Abstract

Hyperparameter optimization is an essential component in many data science pipelines and typically entails exhaustive time and resource-consuming computations in order to explore the combinatorial search space. Similar to this problem, other key operations in data science pipelines exhibit the exact same properties. Important examples are: neural architecture search, where the goal is to identify the best design choices for a neural network, and query cardinality estimation, where given different predicate values for a SQL query the goal is to estimate the size of the output. In this paper, we abstract away those essential components of data science pipelines and we model them as instances of tensor completion, where each variable of the search space corresponds to one mode of the tensor, and the goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values, starting from a very small sample of observed entries. In order to do so, we first conduct a thorough experimental evaluation of existing state-of-the-art tensor completion techniques and introduce domain-inspired adaptations (such as smoothness across the discretized variable space) and an ensemble technique which is able to achieve state-of-the-art performance. We extensively evaluate existing and proposed methods in a number of datasets generated corresponding to (a) hyperparameter optimization for non-neural network models, (b) neural architecture search, and (c) variants of query cardinality estimation, demonstrating the effectiveness of tensor completion as a tool for automating data science pipelines. Furthermore, we release our generated datasets and code in order to provide benchmarks for future work on this topic.

Automating Data Science Pipelines with Tensor Completion

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 8 figures, 6 tables)

This paper contains 33 sections, 5 equations, 8 figures, 6 tables.

Introduction
Preliminaries and Problem Formulation
Preliminaries
Data Science Pipelines
Surrogate Modeling
Tensors
Tensor Decomposition
Tensor Completion
Tensor Completion Training
Problem Definition
Tensor Completion for Hyperparameter Tuning
Tensor Completion for Neural Architecture Search
Tensor Completion for Query Cardinality Estimation
Tensor Completion for Query Distinct Cardinality Estimation
Proposed Method
...and 18 more sections

Figures (8)

Figure 1: In this work we unify a number of combinatorial and highly computationally intense data science tasks, such as hyperparameter optimization, neural architecture search, and query cardinality estimation, under the umbrella of tensor completion. We conduct a thorough and extensive study of existing tensor completion methods and propose a novel method for accurately recovering the entire search space in those data science tasks from a small number of observations, towards automating data science pipelines.
Figure 2: Indicative tensor slices where the nature of the hyperparameters involved results in smoothness across those dimensions motivating our proposed smoothness constrained CPD-S method.
Figure 3: Sparse Tensor Completion MAE using 5% observed values. Each cell represents average MAE over 5 iterations.
Figure 4: Error vs. levels of sparsity for various models across all four tasks. CoSTCo liu2019costco & its ensemble consistently perform the best with little observed entries. Closer to 5% observed entries, the rest of the models seem to catch up. Our proposed CPD-S & $\text{TenSemble-CPD-S}_{\_ MLP}$, however, has similar performance to CoSTCo liu2019costco, without using as many parameters as a CNN.
Figure 5: CPD-S Error with different lambda coefficient values, compared with regular CPD & Naive methods. These graphs display that a positive lambda value (enforcing smoothness constraint) almost always decreases the error in the scope of our application.
...and 3 more figures

Automating Data Science Pipelines with Tensor Completion

TL;DR

Abstract

Automating Data Science Pipelines with Tensor Completion

Authors

TL;DR

Abstract

Table of Contents

Figures (8)