Table of Contents
Fetching ...

A Systematic Evaluation of Generative Models on Tabular Transportation Data

Chengen Wang, Alvaro Cardenas, Gurcan Comert, Murat Kantarcioglu

TL;DR

This work tackles privacy-utility trade-offs in sharing large-scale transportation data by systematically evaluating common tabular generative models on NYC taxi data. It introduces two novel metrics—a graph-based transportation-network similarity and a percentile-based privacy leakage ratio (rDCR)—to capture structural fidelity and privacy risk beyond standard metrics. The experiments show TabDDPM generally offers the best overall performance, though it can struggle with high-cardinality categorical features and may exhibit mode collapse, underscoring the need for domain-tailored models. Overall, the paper argues for transportation-aware generative modeling and broader datasets to safely enable data sharing for planning and policy insights.

Abstract

The sharing of large-scale transportation data is beneficial for transportation planning and policymaking. However, it also raises significant security and privacy concerns, as the data may include identifiable personal information, such as individuals' home locations. To address these concerns, synthetic data generation based on real transportation data offers a promising solution that allows privacy protection while potentially preserving data utility. Although there are various synthetic data generation techniques, they are often not tailored to the unique characteristics of transportation data, such as the inherent structure of transportation networks formed by all trips in the datasets. In this paper, we use New York City taxi data as a case study to conduct a systematic evaluation of the performance of widely used tabular data generative models. In addition to traditional metrics such as distribution similarity, coverage, and privacy preservation, we propose a novel graph-based metric tailored specifically for transportation data. This metric evaluates the similarity between real and synthetic transportation networks, providing potentially deeper insights into their structural and functional alignment. We also introduced an improved privacy metric to address the limitations of the commonly-used one. Our experimental results reveal that existing tabular data generative models often fail to perform as consistently as claimed in the literature, particularly when applied to transportation data use cases. Furthermore, our novel graph metric reveals a significant gap between synthetic and real data. This work underscores the potential need to develop generative models specifically tailored to take advantage of the unique characteristics of emerging domains, such as transportation.

A Systematic Evaluation of Generative Models on Tabular Transportation Data

TL;DR

This work tackles privacy-utility trade-offs in sharing large-scale transportation data by systematically evaluating common tabular generative models on NYC taxi data. It introduces two novel metrics—a graph-based transportation-network similarity and a percentile-based privacy leakage ratio (rDCR)—to capture structural fidelity and privacy risk beyond standard metrics. The experiments show TabDDPM generally offers the best overall performance, though it can struggle with high-cardinality categorical features and may exhibit mode collapse, underscoring the need for domain-tailored models. Overall, the paper argues for transportation-aware generative modeling and broader datasets to safely enable data sharing for planning and policy insights.

Abstract

The sharing of large-scale transportation data is beneficial for transportation planning and policymaking. However, it also raises significant security and privacy concerns, as the data may include identifiable personal information, such as individuals' home locations. To address these concerns, synthetic data generation based on real transportation data offers a promising solution that allows privacy protection while potentially preserving data utility. Although there are various synthetic data generation techniques, they are often not tailored to the unique characteristics of transportation data, such as the inherent structure of transportation networks formed by all trips in the datasets. In this paper, we use New York City taxi data as a case study to conduct a systematic evaluation of the performance of widely used tabular data generative models. In addition to traditional metrics such as distribution similarity, coverage, and privacy preservation, we propose a novel graph-based metric tailored specifically for transportation data. This metric evaluates the similarity between real and synthetic transportation networks, providing potentially deeper insights into their structural and functional alignment. We also introduced an improved privacy metric to address the limitations of the commonly-used one. Our experimental results reveal that existing tabular data generative models often fail to perform as consistently as claimed in the literature, particularly when applied to transportation data use cases. Furthermore, our novel graph metric reveals a significant gap between synthetic and real data. This work underscores the potential need to develop generative models specifically tailored to take advantage of the unique characteristics of emerging domains, such as transportation.

Paper Structure

This paper contains 26 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Privacy leakage assessment based on rDCR, i.e., DCR ratio $rs/hs$.
  • Figure 2: The complexity of the generative models in terms of running time in minutes. All the models are tested on 40000.0 samples.