Table of Contents
Fetching ...

Driving Privacy Forward: Mitigating Information Leakage within Smart Vehicles through Synthetic Data Generation

Krish Parikh

TL;DR

The paper addresses privacy risks in smart vehicles by proposing synthetic data as a privacy-preserving proxy. It develops a 14-signal information leakage taxonomy, selects a Tabular Variational Autoencoder (TVAE) to generate synthetic GPS-related data from the Passive Vehicular Sensor (PVS) dataset, and evaluates fidelity, utility, and privacy. Results show high fidelity (~90%) between real and synthetic data, but a notable drop in downstream utility when models are trained on synthetic data (~77.8% accuracy on real data) compared with real-data baselines (~96.9%). Importantly, synthetic data obscures exact routes and precise locations, reducing re-identification risk, though with an acknowledged privacy-utility trade-off. The work highlights the potential and limitations of synthetic data for enabling smart-vehicle research while safeguarding driver privacy, and suggests hybrid approaches as promising future directions.

Abstract

Smart vehicles produce large amounts of data, much of which is sensitive and at risk of privacy breaches. As attackers increasingly exploit anonymised metadata within these datasets to profile drivers, it's important to find solutions that mitigate this information leakage without hindering innovation and ongoing research. Synthetic data has emerged as a promising tool to address these privacy concerns, as it allows for the replication of real-world data relationships while minimising the risk of revealing sensitive information. In this paper, we examine the use of synthetic data to tackle these challenges. We start by proposing a comprehensive taxonomy of 14 in-vehicle sensors, identifying potential attacks and categorising their vulnerability. We then focus on the most vulnerable signals, using the Passive Vehicular Sensor (PVS) dataset to generate synthetic data with a Tabular Variational Autoencoder (TVAE) model, which included over 1 million data points. Finally, we evaluate this against 3 core metrics: fidelity, utility, and privacy. Our results show that we achieved 90.1% statistical similarity and 78% classification accuracy when tested on its original intent while also preventing the profiling of the driver. The code can be found at https://github.com/krish-parikh/Synthetic-Data-Generation

Driving Privacy Forward: Mitigating Information Leakage within Smart Vehicles through Synthetic Data Generation

TL;DR

The paper addresses privacy risks in smart vehicles by proposing synthetic data as a privacy-preserving proxy. It develops a 14-signal information leakage taxonomy, selects a Tabular Variational Autoencoder (TVAE) to generate synthetic GPS-related data from the Passive Vehicular Sensor (PVS) dataset, and evaluates fidelity, utility, and privacy. Results show high fidelity (~90%) between real and synthetic data, but a notable drop in downstream utility when models are trained on synthetic data (~77.8% accuracy on real data) compared with real-data baselines (~96.9%). Importantly, synthetic data obscures exact routes and precise locations, reducing re-identification risk, though with an acknowledged privacy-utility trade-off. The work highlights the potential and limitations of synthetic data for enabling smart-vehicle research while safeguarding driver privacy, and suggests hybrid approaches as promising future directions.

Abstract

Smart vehicles produce large amounts of data, much of which is sensitive and at risk of privacy breaches. As attackers increasingly exploit anonymised metadata within these datasets to profile drivers, it's important to find solutions that mitigate this information leakage without hindering innovation and ongoing research. Synthetic data has emerged as a promising tool to address these privacy concerns, as it allows for the replication of real-world data relationships while minimising the risk of revealing sensitive information. In this paper, we examine the use of synthetic data to tackle these challenges. We start by proposing a comprehensive taxonomy of 14 in-vehicle sensors, identifying potential attacks and categorising their vulnerability. We then focus on the most vulnerable signals, using the Passive Vehicular Sensor (PVS) dataset to generate synthetic data with a Tabular Variational Autoencoder (TVAE) model, which included over 1 million data points. Finally, we evaluate this against 3 core metrics: fidelity, utility, and privacy. Our results show that we achieved 90.1% statistical similarity and 78% classification accuracy when tested on its original intent while also preventing the profiling of the driver. The code can be found at https://github.com/krish-parikh/Synthetic-Data-Generation

Paper Structure

This paper contains 38 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The synthetic data retains the structure of the original data but is not the same
  • Figure 2: Data flows from within your vehicle system to various entities such as manufacturers, service providers, and third-parties caltrider2023.
  • Figure 3: Typical data augmentation example using rotation, reflection, and translation of an image ubiai2023.
  • Figure 4: Describes the available methods, models and use cases for generating synthetic data
  • Figure 5: TVAE training loss over 200 epochs.
  • ...and 5 more figures