Table of Contents
Fetching ...

FlowDistill: Scalable Traffic Flow Prediction via Distillation from LLMs

Chenyang Yu, Xinpeng Xie, Yan Huang, Chenxi Qiu

TL;DR

This work tackles data-efficient traffic flow prediction by combining large language models (LLMs) with a lightweight MLP through knowledge distillation. A three-module FlowDistill framework—Instruction Tuning, Teacher Guidance, and a Variational Information Bottleneck–regularized MLP—encodes spatio-temporal context via embeddings and learns compact latent representations for forecasting. The approach demonstrates superior accuracy and scalability on NYC and Chicago taxi datasets, achieving strong performance with only 10% of the data needed by graph-based baselines and significantly lower memory and latency. The results suggest that LLM-informed KD can yield practical, edge-friendly traffic prediction capable of adapting across diverse urban environments with limited labeled data.

Abstract

Accurate traffic flow prediction is vital for optimizing urban mobility, yet it remains difficult in many cities due to complex spatio-temporal dependencies and limited high-quality data. While deep graph-based models demonstrate strong predictive power, their performance often comes at the cost of high computational overhead and substantial training data requirements, making them impractical for deployment in resource-constrained or data-scarce environments. We propose the FlowDistill, a lightweight and scalable traffic prediction framework based on knowledge distillation from large language models (LLMs). In this teacher-student setup, a fine-tuned LLM guides a compact multi-layer perceptron (MLP) student model using a novel combination of the information bottleneck principle and teacher-bounded regression loss, ensuring the distilled model retains only essential and transferable knowledge. Spatial and temporal correlations are explicitly encoded to enhance the model's generalization across diverse urban settings. Despite its simplicity, FlowDistill consistently outperforms state-of-the-art models in prediction accuracy while requiring significantly less training data, and achieving lower memory usage and inference latency, highlighting its efficiency and suitability for real-world, scalable deployment.

FlowDistill: Scalable Traffic Flow Prediction via Distillation from LLMs

TL;DR

This work tackles data-efficient traffic flow prediction by combining large language models (LLMs) with a lightweight MLP through knowledge distillation. A three-module FlowDistill framework—Instruction Tuning, Teacher Guidance, and a Variational Information Bottleneck–regularized MLP—encodes spatio-temporal context via embeddings and learns compact latent representations for forecasting. The approach demonstrates superior accuracy and scalability on NYC and Chicago taxi datasets, achieving strong performance with only 10% of the data needed by graph-based baselines and significantly lower memory and latency. The results suggest that LLM-informed KD can yield practical, edge-friendly traffic prediction capable of adapting across diverse urban environments with limited labeled data.

Abstract

Accurate traffic flow prediction is vital for optimizing urban mobility, yet it remains difficult in many cities due to complex spatio-temporal dependencies and limited high-quality data. While deep graph-based models demonstrate strong predictive power, their performance often comes at the cost of high computational overhead and substantial training data requirements, making them impractical for deployment in resource-constrained or data-scarce environments. We propose the FlowDistill, a lightweight and scalable traffic prediction framework based on knowledge distillation from large language models (LLMs). In this teacher-student setup, a fine-tuned LLM guides a compact multi-layer perceptron (MLP) student model using a novel combination of the information bottleneck principle and teacher-bounded regression loss, ensuring the distilled model retains only essential and transferable knowledge. Spatial and temporal correlations are explicitly encoded to enhance the model's generalization across diverse urban settings. Despite its simplicity, FlowDistill consistently outperforms state-of-the-art models in prediction accuracy while requiring significantly less training data, and achieving lower memory usage and inference latency, highlighting its efficiency and suitability for real-world, scalable deployment.

Paper Structure

This paper contains 32 sections, 19 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The comparison of memory cost and training data required to achieve the same performance (i.e., $\text{MAE} = 7.22$ on the NYC dataset). Note that STSGCN, ASTGCN, ST-MLP, and EasyST fail to reach $\text{MAE} = 7.22$ even with the maximum training data proportion. The detailed explanation of the experimental results related to this figure can be found in Table \ref{['table:trainingtime']}. Here, memory cost is represented by the bubble size.
  • Figure 2: Overall model framework. ① Instruction Tuning Module ② Teacher Guidance Module ③ VIB-MLP Module: The aggregated embeddings, including spatial context, time of day, and day of week, are processed through an MLP to derive the latent variables $\boldsymbol{\mu}_Z$ and $\boldsymbol{\sigma}^2_Z$. Using the reparameterization trick, the latent representation $Z$ is sampled and passed through a fully connected layer to generate the final prediction $\hat{Y}$.④ Spatial-Temporal Regularized Loss
  • Figure 3: Performance w.r.t training data ratio in NYC
  • Figure 4: Performance w.r.t training data ratio in Chicago
  • Figure 5: Temporal-based prediction comparison in NYC
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4