LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting
Yu-Neng Chuang, Songchen Li, Jiayi Yuan, Guanchu Wang, Kwei-Herng Lai, Songyuan Sui, Leisheng Yu, Sirui Ding, Chia-Yuan Chang, Qiaoyu Tan, Daochen Zha, Xia Hu
TL;DR
The paper addresses the challenge of training universal large time series models (LTSMs) across heterogeneous datasets and proposes LTSM-Bundle, a modular toolbox and benchmark. It systematically evaluates design choices across prompting, tokenization, training paradigms, base models, data quantity, and dataset diversity, using a suite of TSF benchmarks and a TDengine-backed end-to-end pipeline. The study finds that time series prompts, linear tokenization, fully fine-tuned GPT-2-Medium backbones, and moderate data down-sampling with diverse datasets yield strong zero-shot and few-shot performance, often matching or surpassing full-data baselines with only 5% of the data. These findings offer a practical baseline and a scalable platform for future research and real-world TSF tasks, enabling reproducible, component-wise analysis of LTSM training strategies.
Abstract
Time Series Forecasting (TSF) has long been a challenge in time series analysis. Inspired by the success of Large Language Models (LLMs), researchers are now developing Large Time Series Models (LTSMs)-universal transformer-based models that use autoregressive prediction-to improve TSF. However, training LTSMs on heterogeneous time series data poses unique challenges, including diverse frequencies, dimensions, and patterns across datasets. Recent endeavors have studied and evaluated various design choices aimed at enhancing LTSM training and generalization capabilities. However, these design choices are typically studied and evaluated in isolation and are not benchmarked collectively. In this work, we introduce LTSM-Bundle, a comprehensive toolbox, and benchmark for training LTSMs, spanning pre-processing techniques, model configurations, and dataset configuration. It modularized and benchmarked LTSMs from multiple dimensions, encompassing prompting strategies, tokenization approaches, training paradigms, base model selection, data quantity, and dataset diversity. Furthermore, we combine the most effective design choices identified in our study. Empirical results demonstrate that this combination achieves superior zero-shot and few-shot performances compared to state-of-the-art LTSMs and traditional TSF methods on benchmark datasets.
