Table of Contents
Fetching ...

Synthetic location trajectory generation using categorical diffusion models

Simon Dirmeier, Ye Hong, Fernando Perez-Cruz

TL;DR

This work introduces a categorical diffusion model operating in a continuous latent space to generate synthetic individual location trajectories (ILTs) from GNSS data. By embedding discrete location sequences into latent embeddings, applying diffusion in that space, and decoding back to discrete locations, the approach enables both conditional (infilling) and unconditional synthesis. Experimental results on GC GNSS data show that conditionally generated ILTs replicate key statistics such as entropy and visit counts, while unconditional generation yields similar entropy with some biases in distance distributions. The proposed method offers a privacy-friendly tool for benchmarking mobility methodologies and assessing synthetic data quality in mobility research.

Abstract

Diffusion probabilistic models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data, for instance, for computer vision, audio, natural language processing, or biomolecule generation. Here, we propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals. ILTs are of major importance in mobility research to understand the mobility behavior of populations and to ultimately inform political decision-making. We represent ILTs as multi-dimensional categorical random variables and propose to model their joint distribution using a continuous DPM by first applying the diffusion process in a continuous unconstrained space and then mapping the continuous variables into a discrete space. We demonstrate that our model can synthesize realistic ILPs by comparing conditionally and unconditionally generated sequences to real-world ILPs from a GNSS tracking data set which suggests the potential use of our model for synthetic data generation, for example, for benchmarking models used in mobility research.

Synthetic location trajectory generation using categorical diffusion models

TL;DR

This work introduces a categorical diffusion model operating in a continuous latent space to generate synthetic individual location trajectories (ILTs) from GNSS data. By embedding discrete location sequences into latent embeddings, applying diffusion in that space, and decoding back to discrete locations, the approach enables both conditional (infilling) and unconditional synthesis. Experimental results on GC GNSS data show that conditionally generated ILTs replicate key statistics such as entropy and visit counts, while unconditional generation yields similar entropy with some biases in distance distributions. The proposed method offers a privacy-friendly tool for benchmarking mobility methodologies and assessing synthetic data quality in mobility research.

Abstract

Diffusion probabilistic models (DPMs) have rapidly evolved to be one of the predominant generative models for the simulation of synthetic data, for instance, for computer vision, audio, natural language processing, or biomolecule generation. Here, we propose using DPMs for the generation of synthetic individual location trajectories (ILTs) which are sequences of variables representing physical locations visited by individuals. ILTs are of major importance in mobility research to understand the mobility behavior of populations and to ultimately inform political decision-making. We represent ILTs as multi-dimensional categorical random variables and propose to model their joint distribution using a continuous DPM by first applying the diffusion process in a continuous unconstrained space and then mapping the continuous variables into a discrete space. We demonstrate that our model can synthesize realistic ILPs by comparing conditionally and unconditionally generated sequences to real-world ILPs from a GNSS tracking data set which suggests the potential use of our model for synthetic data generation, for example, for benchmarking models used in mobility research.
Paper Structure (26 sections, 15 equations, 4 figures)

This paper contains 26 sections, 15 equations, 4 figures.

Figures (4)

  • Figure 1: Ablation study. We evaluate the influence of embedding dimensionality, score model parameterization and use of self-conditioning on the objective function (Equation \ref{['eqn:full-continuous-elbo']}) and show the value of the objectives on a validation set relative to the best parameterization (i.e., best model has a relative ELBO of $1$, all others are higher). We find that an embedding dimensionality of $16$ with $\bm{z}_0$-parameterization and self-conditioning yields the best results on our data set, and that the embedding dimensionality has only minimal influence on performance with a $\bm{z}_0$-parameterization in comparison to the ${\boldsymbol \epsilon}_t$-parameterization (c.f. li2022diffusion).
  • Figure 2: Empirical comparison of statistics of observational and synthesized location sequences. We compute the entropy, the number of visits per location and travel distances over each location trajectory and visualize the histogram of these statistics (the greater the overlap the better). The first four columns in each figure show mechanistic models, the last two columns show CDPM simulations with conditional and unconditional synthesis, respectively.
  • Figure 3: Model architecture.
  • Figure :