Table of Contents
Fetching ...

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre Drouin

TL;DR

CiK introduces a principled benchmark for forecasting that mandates leveraging natural language context alongside numerical histories. It formalizes context-aided forecasting, proposes a Region of Interest CRPS scoring rule, and evaluates a wide spectrum of models, finding that context-rich LLM prompting often yields strong, data-efficient forecasts albeit with notable robustness and cost tradeoffs. The work demonstrates that contextual information is both crucial and challenging to harness, offering insights and metrics to drive development of accurate, accessible multimodal forecasters. By providing open data, tasks, and evaluation protocols, CiK aims to accelerate research in context-aware, probabilistic time-series forecasting with real-world impact.

Abstract

Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0/.

Context is Key: A Benchmark for Forecasting with Essential Textual Information

TL;DR

CiK introduces a principled benchmark for forecasting that mandates leveraging natural language context alongside numerical histories. It formalizes context-aided forecasting, proposes a Region of Interest CRPS scoring rule, and evaluates a wide spectrum of models, finding that context-rich LLM prompting often yields strong, data-efficient forecasts albeit with notable robustness and cost tradeoffs. The work demonstrates that contextual information is both crucial and challenging to harness, offering insights and metrics to drive development of accurate, accessible multimodal forecasters. By providing open data, tasks, and evaluation protocols, CiK aims to accelerate research in context-aware, probabilistic time-series forecasting with real-world impact.

Abstract

Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0/.

Paper Structure

This paper contains 74 sections, 13 equations, 35 figures, 12 tables.

Figures (35)

  • Figure 1: An example task from the proposed Context is Key (CiK) benchmark with [1]GPT-4o forecasts in blue and the ground truth in yellow. Left: Forecasts based on the numerical history alone are inaccurate, as nothing indicates a reversion to zero. Right: The context enables better forecasts because it reveals that the series represents photovoltaic power production. Hence, the model can deduce that no power will be produced at night. The context also enables better estimation of the peak hour of production by providing statistics from the history.
  • Figure 2: The tasks in the CiK benchmark rely on real-world numerical data from 7 domains.
  • Figure 3: Number of tasks per context type in CiK.
  • Figure 4: Illustration of a CiK task annotated with types of natural language context: ① The short numerical history is misleading, suggesting an increasing trend. However, contextual information compensates and enables accurate forecasts: ② The intemporal information ($\mathbf{c}_I$) reveals the nature of the series, implying a seasonal pattern with greater prevalence in the summer months due to weather. ③ The future information ($\mathbf{c}_{F}$) reveals that the series cannot continue its increasing trend. ④ The historical information ($\mathbf{c}_H$) complements the short history by providing high-level statistics on past values. ⑤ The covariate information ($\mathbf{c}_{\text{cov}}$) reveals an association with another quantity: field fires. Could its behavior impacts future values of the target series? ⑥ No, the causal information ($\mathbf{c}_{\text{causal}}$) provides the answer.
  • Figure 5: Proportion of tasks for which LLM-based methods outperform the 7 quantitative forecasting methods (see \ref{['subsec:baselines']}). A method is considered to outperform another on a task if its average RCPRS is lower on said task. Results are shown for variants that use (left) and do not use (right) the natural language context. A full green bar would indicate that the method is better on all tasks, whereas a full red bar would indicate that it is worse everywhere. Tasks are weighted according to \ref{['subsec:protocol']}.
  • ...and 30 more figures