Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

Jonathan Will; Dominik Scheinert; Jan Bode; Cedric Kring; Seraphin Zunzer; Lauritz Thamsen

Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

Jonathan Will, Dominik Scheinert, Jan Bode, Cedric Kring, Seraphin Zunzer, Lauritz Thamsen

TL;DR

This work presents a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis between collaborating organizations, and indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available.

Abstract

Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime metrics between collaborating organizations. Yet, not all organizations may be inclined to publicly disclose such metadata. We present a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis. Our evaluation on performance data from 736 Spark job executions indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available. With 30 or fewer available original data samples, the use of synthetic training data resulted only in a one percent reduction in performance model accuracy on average.

Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

TL;DR

Abstract

Paper Structure (16 sections, 4 figures)

This paper contains 16 sections, 4 figures.

Introduction
Related Work
Dataflow Job Performance Modeling
Privacy in Collaborative Machine Learning
1. Aggregation
2. Encryption
3. Obfuscation
Approach
Idea Overview
Data Obfuscation via Data Synthesis
Evaluation
Experimental Setup
Performance Modeling with Synthetic Data
Data Synthesis Overhead
Discussion
...and 1 more sections

Figures (4)

Figure 1: High-level overview of privacy-preserving runtime metrics sharing for collaborative performance modeling.
Figure 2: Synthetic training dataset size and resulting performance model error compared to using the full original data.
Figure 3: Performance model error with 1000 synthetic data samples, generated from small amounts of original data.
Figure 4: Overhead for creating synthetic data for different Spark job performance datasets.

Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

TL;DR

Abstract

Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (4)