Rethinking the Role of LLMs in Time Series Forecasting
Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang, Xiaoyu Shen
TL;DR
Rethinking the Role of LLMs in Time Series Forecasting demonstrates that large-language-model-based TSF (LLM4TSF) yields meaningful forecasting improvements, especially in cross-domain generalization, when trained with diverse, cross-dataset data and properly aligned inputs. By disentangling pretrained knowledge from architectural capacity and comparing pre-alignment versus post-alignment strategies, the study shows that gains arise from both priors and modeling power, with pre-alignment often providing more effective integration. A novel routing analysis at the token level provides mechanistic evidence for when and how LLMs contribute, and prompts consistently improve performance, underscoring the value of semantic guidance. The work offers practical design guidelines and releases code to foster robust, cross-domain TSF systems, while cautioning against blind scaling without appropriate modality alignment.
Abstract
Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90\% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at https://github.com/EIT-NLP/LLM4TSF.
