Table of Contents
Fetching ...

Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting

Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu

TL;DR

The paper addresses a core bottleneck in time-series forecasting: the lack of credible, leak-free benchmarks that reflect real-world data dynamics. It proposes Fidel-TS, a benchmark built from live API streams with exogenous textual context (primarily weather and maintenance signals) and a clear Subject–Channel data structure to enable robust generalization testing. Through a universal cross-modal evaluation framework, the authors show that prior benchmarks exhibit biases and that textual information's causal relevance determines true multimodal gains, with FIATS often delivering the best multimodal performance while LLMs struggle under leak-free conditions. The work provides a principled standard for evaluating forecasting models, highlighting implications for model development and the limitations of relying on traditional benchmarks for progress evaluation.

Abstract

The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.

Fidel-TS: A High-Fidelity Benchmark for Multimodal Time Series Forecasting

TL;DR

The paper addresses a core bottleneck in time-series forecasting: the lack of credible, leak-free benchmarks that reflect real-world data dynamics. It proposes Fidel-TS, a benchmark built from live API streams with exogenous textual context (primarily weather and maintenance signals) and a clear Subject–Channel data structure to enable robust generalization testing. Through a universal cross-modal evaluation framework, the authors show that prior benchmarks exhibit biases and that textual information's causal relevance determines true multimodal gains, with FIATS often delivering the best multimodal performance while LLMs struggle under leak-free conditions. The work provides a principled standard for evaluating forecasting models, highlighting implications for model development and the limitations of relying on traditional benchmarks for progress evaluation.

Abstract

The evaluation of time series forecasting models is hindered by a critical lack of high-quality benchmarks, leading to a potential illusion of progress. Existing datasets suffer from issues ranging from pre-training data contamination in the age of LLMs to the causal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, strict causal soundness, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from the ground up on these principles by sourcing data from live APIs. Our extensive experiments validate this approach by exposing the critical biases and design limitations of prior benchmarks. Furthermore, we conclusively demonstrate that the causal relevance of textual information is the key factor in unlocking genuine performance gains in multimodal forecasting.

Paper Structure

This paper contains 41 sections, 1 equation, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The Construction Pipeline of Fidel-TS. The process integrates raw time series from diverse subjects and channels with dynamic textual information, all from real-time API. The Imperfection Handling step interpolates short data gaps while converting long downtime into time-aligned 'Sensor Downtime' events. This enriched data, combined with static metadata, forms a unified dataset that is systematically organized into Realistic Benchmark Settings: downsampled (-h), sampled via importance (-mini), observed (-obs, for in-domain evaluation), and hidden (-hid, for generalization evaluation).
  • Figure 2: Visualization of the time series data in three datasets with a single channel. We select a representative time periods to show the patterns of the time series in each dataset.
  • Figure 3: Visualization of the textual data. We select representative weather data and control events.
  • Figure 4: Prompt template for LLM in unimodal forecasting
  • Figure 5: Prompt template for LLM in multimodal forecasting