Table of Contents
Fetching ...

Vision-LLMs for Spatiotemporal Traffic Forecasting

Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

TL;DR

This work introduces ST-Vision-LLM, a vision-language framework for 2D spatiotemporal traffic forecasting that processes historical global traffic matrices as image sequences via a Vision-LLM encoder to inform cell-level predictions. It tackles numeric data efficiency with a direct floating-point token vocabulary and a two-stage fine-tuning pipeline (SFT followed by GRPO), enabling accurate per-cell forecasting under a global context. Across long-term, cross-domain, few-shot, and zero-shot benchmarks on Milan and Trentino data, ST-Vision-LLM delivers state-of-the-art results and demonstrates strong generalization and data efficiency, especially in data-scarce environments. The approach offers a practical, scalable paradigm for spatiotemporal prediction in dense urban networks by leveraging vision-language models to capture complex spatial dependencies without heavy graph-based architectures.

Abstract

Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.

Vision-LLMs for Spatiotemporal Traffic Forecasting

TL;DR

This work introduces ST-Vision-LLM, a vision-language framework for 2D spatiotemporal traffic forecasting that processes historical global traffic matrices as image sequences via a Vision-LLM encoder to inform cell-level predictions. It tackles numeric data efficiency with a direct floating-point token vocabulary and a two-stage fine-tuning pipeline (SFT followed by GRPO), enabling accurate per-cell forecasting under a global context. Across long-term, cross-domain, few-shot, and zero-shot benchmarks on Milan and Trentino data, ST-Vision-LLM delivers state-of-the-art results and demonstrates strong generalization and data efficiency, especially in data-scarce environments. The approach offers a practical, scalable paradigm for spatiotemporal prediction in dense urban networks by leveraging vision-language models to capture complex spatial dependencies without heavy graph-based architectures.

Abstract

Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.

Paper Structure

This paper contains 22 sections, 21 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The ST-Vision-LLM Framework. Given global historical spatiotemporal traffic information, we first normalize the spatiotemporal traffic data, then input it into (1) the image encoder of the multimodal LLM to obtain encoded information in the form of image patches, which are subsequently fed into the LLM's context embedding. Simultaneously, we input (2) the target geographic grid, metadata, and task instructions in textual form, which are processed through a text tokenizer and embedding before being fed into the LLM context. Subsequently, the LLM performs inference and outputs predicted future traffic information in the form of (3) numerical tokens (where (3) represents the token IDs output by the LLM, and (4) represents the human-readable labels of the numerical tokens corresponding to the token IDs). Finally, the (3) numerical tokens are mapped to (5) the final traffic forecasting results through a numerical token mapper, which represents the future traffic sequence predictions for the current geographic grid.