Table of Contents
Fetching ...

Analyzing the Role of Context in Forecasting with Large Language Models

Gerrit Mutschlechner, Adam Jatowt

TL;DR

The paper tackles how contextual information influences automated forecasting with large language models on binary events. It introduces a new dataset of 614 recent Metaculus questions, each augmented with at least three related news articles and concise summaries, to test context-aware prompting across three LLMs (GPT-3.5-turbo, Alpaca-7B, Llama2-13B-chat). Evaluation across five prompts that progressively add background, news, resolution criteria, and few-shot examples reveals that news context significantly boosts accuracy, while few-shot augmentation can reduce performance; larger models generally perform better. The findings provide practical guidance for prompt design in automated forecasting and motivate further expansion to more questions, models, and timelines. The work highlights the importance of timely, multi-source context in improving the reliability of AI-assisted forecasts and sets the stage for richer datasets and analyses in this domain.

Abstract

This study evaluates the forecasting performance of recent language models (LLMs) on binary forecasting questions. We first introduce a novel dataset of over 600 binary forecasting questions, augmented with related news articles and their concise question-related summaries. We then explore the impact of input prompts with varying level of context on forecasting performance. The results indicate that incorporating news articles significantly improves performance, while using few-shot examples leads to a decline in accuracy. We find that larger models consistently outperform smaller models, highlighting the potential of LLMs in enhancing automated forecasting.

Analyzing the Role of Context in Forecasting with Large Language Models

TL;DR

The paper tackles how contextual information influences automated forecasting with large language models on binary events. It introduces a new dataset of 614 recent Metaculus questions, each augmented with at least three related news articles and concise summaries, to test context-aware prompting across three LLMs (GPT-3.5-turbo, Alpaca-7B, Llama2-13B-chat). Evaluation across five prompts that progressively add background, news, resolution criteria, and few-shot examples reveals that news context significantly boosts accuracy, while few-shot augmentation can reduce performance; larger models generally perform better. The findings provide practical guidance for prompt design in automated forecasting and motivate further expansion to more questions, models, and timelines. The work highlights the importance of timely, multi-source context in improving the reliability of AI-assisted forecasts and sets the stage for richer datasets and analyses in this domain.

Abstract

This study evaluates the forecasting performance of recent language models (LLMs) on binary forecasting questions. We first introduce a novel dataset of over 600 binary forecasting questions, augmented with related news articles and their concise question-related summaries. We then explore the impact of input prompts with varying level of context on forecasting performance. The results indicate that incorporating news articles significantly improves performance, while using few-shot examples leads to a decline in accuracy. We find that larger models consistently outperform smaller models, highlighting the potential of LLMs in enhancing automated forecasting.
Paper Structure (25 sections, 25 figures, 3 tables)

This paper contains 25 sections, 25 figures, 3 tables.

Figures (25)

  • Figure 1: Ratio of questions forecasted as 'no' across forecasts made by various LLMs using different prompts. The numbers on Axis X has the following correspondence: 1: Q; 2: Q,B; 3: Q,B,NA; 4: Q,B,NA,R, 5: Q,B,NA,R,FS.
  • Figure 2: Distribution of categories.
  • Figure 3: Confusion matrices for forecasts with only the question as input.
  • Figure 4: Confusion matrices for forecasts with question and background information as input.
  • Figure 5: Confusion matrices for forecasts with question, background information, and news articles as input.
  • ...and 20 more figures