GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning
Umang Garg, Bowen Zhang, Anantajit Subrahmanya, Chandrakanth Gudavalli, BS Manjunath
TL;DR
GPS-MTM tackles learning patterns of normal human mobility from unlabeled GPS traces by decomposing trajectories into states and actions and training a self-supervised masked reconstruction objective with a bi-directional Transformer, optimizing $P(\mathcal{T}_{mask} \mid \mathcal{T}_{obs})$. It introduces a multi-modal state–action representation, a joint POI classification and detail regression objective, and demonstrates robust improvements on trajectory infilling and next-stop prediction across four datasets, including Geolife. The results show the model handles long-range dependencies and preserves rare-location distributions (Bias Ratio near 1) even in large POI spaces. The approach positions mobility data as a first-class modality for foundation-model-scale representation learning, with practical implications for urban analytics, anomaly detection, and synthetic trajectory generation.
Abstract
Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.
