Table of Contents
Fetching ...

GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning

Umang Garg, Bowen Zhang, Anantajit Subrahmanya, Chandrakanth Gudavalli, BS Manjunath

TL;DR

GPS-MTM tackles learning patterns of normal human mobility from unlabeled GPS traces by decomposing trajectories into states and actions and training a self-supervised masked reconstruction objective with a bi-directional Transformer, optimizing $P(\mathcal{T}_{mask} \mid \mathcal{T}_{obs})$. It introduces a multi-modal state–action representation, a joint POI classification and detail regression objective, and demonstrates robust improvements on trajectory infilling and next-stop prediction across four datasets, including Geolife. The results show the model handles long-range dependencies and preserves rare-location distributions (Bias Ratio near 1) even in large POI spaces. The approach positions mobility data as a first-class modality for foundation-model-scale representation learning, with practical implications for urban analytics, anomaly detection, and synthetic trajectory generation.

Abstract

Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.

GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning

TL;DR

GPS-MTM tackles learning patterns of normal human mobility from unlabeled GPS traces by decomposing trajectories into states and actions and training a self-supervised masked reconstruction objective with a bi-directional Transformer, optimizing . It introduces a multi-modal state–action representation, a joint POI classification and detail regression objective, and demonstrates robust improvements on trajectory infilling and next-stop prediction across four datasets, including Geolife. The results show the model handles long-range dependencies and preserves rare-location distributions (Bias Ratio near 1) even in large POI spaces. The approach positions mobility data as a first-class modality for foundation-model-scale representation learning, with practical implications for urban analytics, anomaly detection, and synthetic trajectory generation.

Abstract

Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.

Paper Structure

This paper contains 13 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Multimodal representation of mobility trajectories. Daily activities are expressed as states (dwelling at specific locations) and actions (transitions between them). Our model learns to reconstruct missing components, aligning predicted transitions with ground truth movement patterns.
  • Figure 2: Multi-modal trajectory representation and pre-training framework. (A) Token structure with category and feature components, each enhanced with positional and modality embeddings. (B) Masked token reconstruction during pre-training, where a subset of L input tokens are randomly masked and the model learns to reconstruct the missing tokens from the remaining context. (C) Four downstream tasks used for evaluation: [1] goal prediction, [2] random masking, [3] forward dynamics, and [4] inverse dynamics.