Analyzing the Role of Context in Forecasting with Large Language Models
Gerrit Mutschlechner, Adam Jatowt
TL;DR
The paper tackles how contextual information influences automated forecasting with large language models on binary events. It introduces a new dataset of 614 recent Metaculus questions, each augmented with at least three related news articles and concise summaries, to test context-aware prompting across three LLMs (GPT-3.5-turbo, Alpaca-7B, Llama2-13B-chat). Evaluation across five prompts that progressively add background, news, resolution criteria, and few-shot examples reveals that news context significantly boosts accuracy, while few-shot augmentation can reduce performance; larger models generally perform better. The findings provide practical guidance for prompt design in automated forecasting and motivate further expansion to more questions, models, and timelines. The work highlights the importance of timely, multi-source context in improving the reliability of AI-assisted forecasts and sets the stage for richer datasets and analyses in this domain.
Abstract
This study evaluates the forecasting performance of recent language models (LLMs) on binary forecasting questions. We first introduce a novel dataset of over 600 binary forecasting questions, augmented with related news articles and their concise question-related summaries. We then explore the impact of input prompts with varying level of context on forecasting performance. The results indicate that incorporating news articles significantly improves performance, while using few-shot examples leads to a decline in accuracy. We find that larger models consistently outperform smaller models, highlighting the potential of LLMs in enhancing automated forecasting.
