Table of Contents
Fetching ...

MIRAI: Evaluating LLM Agents for Event Forecasting

Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang

TL;DR

MIRAI tackles the problem of rigorously evaluating Large Language Model (LLM) agents in temporal forecasting of international events. It introduces an agentic benchmark that combines a GDELT-derived relational-text dataset with an API-driven environment, enabling autonomous information gathering, API-based code execution, and multi-format reasoning through a ReAct-style loop. The work provides a meticulous data pipeline (including preprocessing, credibility filtering, and test set construction) and a suite of evaluation metrics (precision, recall, F1, KL divergence) across forecasting horizons. Key findings show that tool-rich, multi-hop reasoning (especially with strong LLMs like GPT-4o) yields the best performance, while longer-horizon forecasts remain challenging and reveal the value of inference-time strategies such as self-consistency for smaller models. Overall, MIRAI offers a practical framework to drive development of more accurate and reliable forecasting capabilities for international relations analysis.

Abstract

Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

MIRAI: Evaluating LLM Agents for Event Forecasting

TL;DR

MIRAI tackles the problem of rigorously evaluating Large Language Model (LLM) agents in temporal forecasting of international events. It introduces an agentic benchmark that combines a GDELT-derived relational-text dataset with an API-driven environment, enabling autonomous information gathering, API-based code execution, and multi-format reasoning through a ReAct-style loop. The work provides a meticulous data pipeline (including preprocessing, credibility filtering, and test set construction) and a suite of evaluation metrics (precision, recall, F1, KL divergence) across forecasting horizons. Key findings show that tool-rich, multi-hop reasoning (especially with strong LLMs like GPT-4o) yields the best performance, while longer-horizon forecasts remain challenging and reveal the value of inference-time strategies such as self-consistency for smaller models. Overall, MIRAI offers a practical framework to drive development of more accurate and reliable forecasting capabilities for international relations analysis.

Abstract

Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.
Paper Structure (63 sections, 8 figures, 8 tables)

This paper contains 63 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An example of forecasting the relations between Australia and China on Nov.18.2023. The database contains query-related historical relations and news articles, while the agent fails to predict the change of relation and makes a wrong forecast.
  • Figure 2: Mirai comprehensively covers global event data. (a) The circular chart shows the relation hierarchy and distribution in Mirai. (b) The heatmap visualizes the intensity of these events globally, distinguishing between areas of conflict (red) and mediation (blue). (c) The heatmap illustrates the frequency of these events, highlighting regions with the most occurrences.
  • Figure 3: Overview of the LLM agent's interaction with the multi-source environment using the ReAct strategy for forecasting a query event. The framework consists of three main steps: (1) Think: The agent analyzes the current status and plans the next action based on the query and the provided API specifications. (2) Act: The agent generates a " foouclablue!20 Single Function" call or a " fooblue!20 Code Block" to retrieve and analyze relevant data from the database. (3) Execute: The Python interpreter runs the generated code with the API implementation and database and produces observations. These steps are iteratively performed until the agent reaches a final forecast for the future relation.
  • Figure 4: a) Self-consistency of Mistral-7B model increases with more samples. b) F1 scores of different base LLM agents on relation prediction, categorized based on the quadratic classes.
  • Figure 5: a) F1 Accuracy for each API function. b) Code execution error analysis for different LLMs.
  • ...and 3 more figures