FlowDistill: Scalable Traffic Flow Prediction via Distillation from LLMs
Chenyang Yu, Xinpeng Xie, Yan Huang, Chenxi Qiu
TL;DR
This work tackles data-efficient traffic flow prediction by combining large language models (LLMs) with a lightweight MLP through knowledge distillation. A three-module FlowDistill framework—Instruction Tuning, Teacher Guidance, and a Variational Information Bottleneck–regularized MLP—encodes spatio-temporal context via embeddings and learns compact latent representations for forecasting. The approach demonstrates superior accuracy and scalability on NYC and Chicago taxi datasets, achieving strong performance with only 10% of the data needed by graph-based baselines and significantly lower memory and latency. The results suggest that LLM-informed KD can yield practical, edge-friendly traffic prediction capable of adapting across diverse urban environments with limited labeled data.
Abstract
Accurate traffic flow prediction is vital for optimizing urban mobility, yet it remains difficult in many cities due to complex spatio-temporal dependencies and limited high-quality data. While deep graph-based models demonstrate strong predictive power, their performance often comes at the cost of high computational overhead and substantial training data requirements, making them impractical for deployment in resource-constrained or data-scarce environments. We propose the FlowDistill, a lightweight and scalable traffic prediction framework based on knowledge distillation from large language models (LLMs). In this teacher-student setup, a fine-tuned LLM guides a compact multi-layer perceptron (MLP) student model using a novel combination of the information bottleneck principle and teacher-bounded regression loss, ensuring the distilled model retains only essential and transferable knowledge. Spatial and temporal correlations are explicitly encoded to enhance the model's generalization across diverse urban settings. Despite its simplicity, FlowDistill consistently outperforms state-of-the-art models in prediction accuracy while requiring significantly less training data, and achieving lower memory usage and inference latency, highlighting its efficiency and suitability for real-world, scalable deployment.
