Timeseries Foundation Models for Mobility: A Benchmark Comparison with Traditional and Deep Learning Models
Anita Graser
TL;DR
The paper benchmarks time-series foundation models, particularly TimeGPT, for mobility forecasting against traditional ARIMA/SARIMA and deep-learning baselines using BikeNYC and BikeVIE datasets across horizons $1$, $12$, and $24$ hours. It finds that TimeGPT delivers strong $1$-hour forecasts on BikeNYC but its advantage diminishes at longer horizons, where Seasonal Naive can perform better; on BikeVIE, TimeGPT shows limited gains with substantial uncertainty at $12$ and $24$ hours. The authors highlight potential evaluation-data leakage as a major concern for generalization and advocate testing on unseen datasets. They conclude that foundation models can be viable in data-sparse mobility scenarios and recommend incorporating exogenous covariates to bolster predictive robustness in future work.
Abstract
Crowd and flow predictions have been extensively studied in mobility data science. Traditional forecasting methods have relied on statistical models such as ARIMA, later supplemented by deep learning approaches like ST-ResNet. More recently, foundation models for time series forecasting, such as TimeGPT, Chronos, and LagLlama, have emerged. A key advantage of these models is their ability to generate zero-shot predictions, allowing them to be applied directly to new tasks without retraining. This study evaluates the performance of TimeGPT compared to traditional approaches for predicting city-wide mobility timeseries using two bike-sharing datasets from New York City and Vienna, Austria. Model performance is assessed across short (1-hour), medium (12-hour), and long-term (24-hour) forecasting horizons. The results highlight the potential of foundation models for mobility forecasting while also identifying limitations of our experiments.
