Multimodal Trajectory Representation Learning for Travel Time Estimation
Zhi Liu, Xuyuan Hu, Xiao Han, Zhehao Dai, Zhaolin Deng, Guojiang Shen, Xiangjie Kong
TL;DR
MDTI tackles travel time estimation under multimodal, noisy trajectory data by learning unified representations from GPS sequences, grid trajectories, and road network constraints $T^{gps}$, $T^{grid}$, and $T^{road}$. It employs modality-specific encoders and a cross-modal interactor, coupled with a dynamic trajectory modeling module to adapt information density for varying trajectory lengths, and trains with two self-supervised objectives: a contrastive alignment loss and a masked language modeling loss. The approach yields state-of-the-art results across three real-world datasets, demonstrating robust cross-modal fusion and length-adaptive representation learning for TTE. This framework has practical implications for routing, traffic management, and intelligent transportation systems by enhancing prediction accuracy and generalization under multimodal, real-world conditions.
Abstract
Accurate travel time estimation (TTE) plays a crucial role in intelligent transportation systems. However, it remains challenging due to heterogeneous data sources and complex traffic dynamics. Moreover, conventional approaches typically convert trajectories into fixed-length representations, neglecting the inherent variability of real-world trajectories, which often leads to information loss or feature redundancy. To address these challenges, this paper introduces the Multimodal Dynamic Trajectory Integration (MDTI) framework--a novel multimodal trajectory representation learning approach that integrates GPS sequences, grid trajectories, and road network constraints to enhance TTE accuracy. MDTI employs modality-specific encoders and a cross-modal interaction module to capture complementary spatial, temporal, and topological semantics, while a dynamic trajectory modeling mechanism adaptively regulates information density for trajectories of varying lengths. Two self-supervised pretraining objectives, named contrastive alignment and masked language modeling, further strengthen multimodal consistency and contextual understanding. Extensive experiments on three real-world datasets demonstrate that MDTI consistently outperforms state-of-the-art baselines, confirming its robustness and strong generalization abilities. The code is publicly available at: https://github.com/freshhxy/MDTI/
