Table of Contents
Fetching ...

Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs

Arjun Ashok, Andrew Robert Williams, Vincent Zhihao Zheng, Irina Rish, Nicolas Chapados, Étienne Marcotte, Valentina Zantedeschi, Alexandre Drouin

TL;DR

A unified framework of four strategies that address limitations along three orthogonal dimensions of model diagnostics, accuracy, and efficiency are introduced, providing practitioners with a comprehensive toolkit for practical LLM-based context-aided forecasting.

Abstract

Real-world forecasting requires models to integrate not only historical data but also relevant contextual information provided in textual form. While large language models (LLMs) show promise for context-aided forecasting, critical challenges remain: we lack diagnostic tools to understand failure modes, performance remains far below their potential, and high computational costs limit practical deployment. We introduce a unified framework of four strategies that address these limitations along three orthogonal dimensions: model diagnostics, accuracy, and efficiency. Through extensive evaluation across model families from small open-source models to frontier models including Gemini, GPT, and Claude, we uncover both fundamental insights and practical solutions. Our findings span three key dimensions: diagnostic strategies reveal the "Execution Gap" where models correctly explain how context affects forecasts but fail to apply this reasoning; accuracy-focused strategies achieve substantial performance improvements of 25-50%; and efficiency-oriented approaches show that adaptive routing between small and large models can approach large model accuracy on average while significantly reducing inference costs. These orthogonal strategies can be flexibly integrated based on deployment constraints, providing practitioners with a comprehensive toolkit for practical LLM-based context-aided forecasting.

Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs

TL;DR

A unified framework of four strategies that address limitations along three orthogonal dimensions of model diagnostics, accuracy, and efficiency are introduced, providing practitioners with a comprehensive toolkit for practical LLM-based context-aided forecasting.

Abstract

Real-world forecasting requires models to integrate not only historical data but also relevant contextual information provided in textual form. While large language models (LLMs) show promise for context-aided forecasting, critical challenges remain: we lack diagnostic tools to understand failure modes, performance remains far below their potential, and high computational costs limit practical deployment. We introduce a unified framework of four strategies that address these limitations along three orthogonal dimensions: model diagnostics, accuracy, and efficiency. Through extensive evaluation across model families from small open-source models to frontier models including Gemini, GPT, and Claude, we uncover both fundamental insights and practical solutions. Our findings span three key dimensions: diagnostic strategies reveal the "Execution Gap" where models correctly explain how context affects forecasts but fail to apply this reasoning; accuracy-focused strategies achieve substantial performance improvements of 25-50%; and efficiency-oriented approaches show that adaptive routing between small and large models can approach large model accuracy on average while significantly reducing inference costs. These orthogonal strategies can be flexibly integrated based on deployment constraints, providing practitioners with a comprehensive toolkit for practical LLM-based context-aided forecasting.

Paper Structure

This paper contains 90 sections, 1 equation, 56 figures, 26 tables.

Figures (56)

  • Figure 1: Scope of our study. We propose four complementary strategies that extend naïve Direct Prompting (DP) williams2024context along different dimensions. FxDP (top) enables model diagnostics by eliciting explanations about how context affects forecasts, RouteDP (bottom-left) reduces inference costs through adaptive model routing, and IC-DP and CorDP (bottom-right) substantially improve forecasting accuracy, especially for smaller models.
  • Figure 2: Examples of context-aided forecasting tasks from the Context-is-Key (CiK) benchmark williams2024context. CiK comprises 2,644 time series across 7 real-world domains/datasets; it is designed to benchmark context-aided forecasting models, with tasks where the textual context is necessary for accurate forecasts.
  • Figure 3: Forecast effect explanation accuracy and forecast improvement across models. Each bar shows the percentage of tasks falling into three categories: accurate explanation with improved forecast (green segment), accurate explanation but no forecast improvement (gray segment, the "Execution Gap"), and inaccurate explanation (red segment). Larger models can both reason about forecast effects correctly and apply them to improve forecasts, while smaller models often explain accurately but fail to translate this into improved forecasts. Results are with the panel of LLM judges; we find that the results are robust to the choice of LLM judge; extended results are in \ref{['cref:analysis-judge-human']}.
  • Figure 4: Aggregate RCRPS results comparing Direct Prompting (DP) with In-Context Direct Prompting (IC-DP). IC-DP improves performance for 14/16 models, with gains of 14--56% for small models and 20--40% for mid-size and large models, demonstrating that a single in-context example significantly enhances performance across model scales.
  • Figure 5: The plot shows the average RCRPS achieved using Qwen2.5-0.5B-Inst as the small model as an increasing percentage of tasks are routed to the large model (Llama-405B-Inst). We use Qwen2.5-0.5B-Inst as the router, and compare this to random and ideal routing. The router can meaningfully capture task difficulty and route tasks to improve aggregate performance. A significant 66% of the possible area between random and ideal routing is captured by RouteDP. Results with other models are in \ref{['app:router-plots']}.
  • ...and 51 more figures