Table of Contents
Fetching ...

Traj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation

Zhiyang Zhang, Ningcong Chen, Xin Zhang, Yanhua Li, Shen Su, Hui Lu, Jun Luo

TL;DR

The paper tackles GPS trajectory generation with diffusion models, addressing the loss of street-level detail seen in convolution-based approaches. It introduces Traj-Transformer, a Transformer-based diffusion model that supports two GPS point embeddings (loc-emb and lon-lat-emb) and uses adaLN to condition noise prediction on timesteps and auxiliary information. Empirical results on Chengdu and Xi’an show that lon-lat-emb with larger Transformer capacities yields superior trajectory fidelity and reduced deviation compared with UNet-based baselines, without requiring road-network data. The work demonstrates that transformer architectures can provide high-quality, road-network-unaware trajectory generation and suggests avenues for fully transformer-based end-to-end pipelines, including integration with RoadMAE for conditioning.

Abstract

The widespread use of GPS devices has driven advances in spatiotemporal data mining, enabling machine learning models to simulate human decision making and generate realistic trajectories, addressing both data collection costs and privacy concerns. Recent studies have shown the promise of diffusion models for high-quality trajectory generation. However, most existing methods rely on convolution based architectures (e.g. UNet) to predict noise during the diffusion process, which often results in notable deviations and the loss of fine-grained street-level details due to limited model capacity. In this paper, we propose Trajectory Transformer, a novel model that employs a transformer backbone for both conditional information embedding and noise prediction. We explore two GPS coordinate embedding strategies, location embedding and longitude-latitude embedding, and analyze model performance at different scales. Experiments on two real-world datasets demonstrate that Trajectory Transformer significantly enhances generation quality and effectively alleviates the deviation issues observed in prior approaches.

Traj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation

TL;DR

The paper tackles GPS trajectory generation with diffusion models, addressing the loss of street-level detail seen in convolution-based approaches. It introduces Traj-Transformer, a Transformer-based diffusion model that supports two GPS point embeddings (loc-emb and lon-lat-emb) and uses adaLN to condition noise prediction on timesteps and auxiliary information. Empirical results on Chengdu and Xi’an show that lon-lat-emb with larger Transformer capacities yields superior trajectory fidelity and reduced deviation compared with UNet-based baselines, without requiring road-network data. The work demonstrates that transformer architectures can provide high-quality, road-network-unaware trajectory generation and suggests avenues for fully transformer-based end-to-end pipelines, including integration with RoadMAE for conditioning.

Abstract

The widespread use of GPS devices has driven advances in spatiotemporal data mining, enabling machine learning models to simulate human decision making and generate realistic trajectories, addressing both data collection costs and privacy concerns. Recent studies have shown the promise of diffusion models for high-quality trajectory generation. However, most existing methods rely on convolution based architectures (e.g. UNet) to predict noise during the diffusion process, which often results in notable deviations and the loss of fine-grained street-level details due to limited model capacity. In this paper, we propose Trajectory Transformer, a novel model that employs a transformer backbone for both conditional information embedding and noise prediction. We explore two GPS coordinate embedding strategies, location embedding and longitude-latitude embedding, and analyze model performance at different scales. Experiments on two real-world datasets demonstrate that Trajectory Transformer significantly enhances generation quality and effectively alleviates the deviation issues observed in prior approaches.

Paper Structure

This paper contains 18 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Trajectories generated in a high density urban region, using models that are directly trained on raw GPS trajectories without road network. (A): Convolution-based models struggle to reconstruct the street structure. (B): Our model preserves fine-grained, street-level details, leading to significantly improved generation quality. (C): Raw GPS trajectories collected by GPS devices.
  • Figure 2: Architecture of the Trajectory Transformer (Traj-Transformer). The model takes GPS trajectories as input and supports two alternative embedding strategies for GPS points: (1) loc-emb, which computes an embedding for each location, and (2) lon-lat-emb, which independently embeds longitude and latitude coordinates. These embeddings are then fed into a Transformer backbone, which serves as the core of our model. To enable conditional generation, both the generation conditions and diffusion timesteps are injected into the transformer layers using an adaptive layer norm (adaLN). After passing through the decoder, the model produces noise predictions that are used in the diffusion reverse process to denoise.
  • Figure 3: Density error over training steps on Chengdu. Models using lon-lat embeddings consistently outperform those with loc embeddings throughout all training stages. Similar trends are observed when training on Xi’an.
  • Figure 4: Visualizations of two cities (top: Chengdu; bottom: Xi'an) from different models. Red rectangles indicate high-density urban regions where performance differences among models are most pronounced. To facilitate direct comparison, each highlighted patch presents a magnified view of the corresponding region generated by different models. Lower-performing models tend to overlook fine-grained, street-level structures, whereas higher-performing models more accurately capture and preserve these intricate details. All visualizations depict 5,000 trajectories, rendered without any visual post-processing.
  • Figure 5: Visualizations of two cities (top: Chengdu; bottom: Xi'an) from different models and original trajectory from test set (last column). Each subplot displays 5,000 trajectories generated from the test set without any visual post-processing. Red rectangles highlight high-density urban regions where differences in model performance are most evident. Lower-performing models struggle to preserve fine-grained street-level details in these areas, while higher-performing models more accurately capture and maintain the underlying street structures.
  • ...and 1 more figures