Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting
Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu
TL;DR
The paper addresses a core bottleneck in time-series forecasting: the lack of credible, leak-free benchmarks that reflect real-world data dynamics. It proposes Fidel-TS, a benchmark built from live API streams with exogenous textual context (primarily weather and maintenance signals) and a clear Subject–Channel data structure to enable robust generalization testing. Through a universal cross-modal evaluation framework, the authors show that prior benchmarks exhibit biases and that textual information's causal relevance determines true multimodal gains, with FIATS often delivering the best multimodal performance while LLMs struggle under leak-free conditions. The work provides a principled standard for evaluating forecasting models, highlighting implications for model development and the limitations of relying on traditional benchmarks for progress evaluation.
Abstract
The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.
