Retrieval and Argumentation Enhanced Multi-Agent LLMs for Judgmental Forecasting
Deniz Gorur, Antonio Rago, Francesca Toni
TL;DR
This work frames judgmental forecasting as claim verification and proposes a novel multi-agent framework that aggregates Quantitative Bipolar Argumentation Frameworks (QBAFs) produced by diverse LLM-based agents. By introducing ArgLLM, Retrieval-Augmented ArgLLM (RAG-ArgLLM), and RbAM agents, the approach blends internal LLM reasoning with external grounding to produce robust, explainable QBAFs. A central Multi-Agent QBAF Combinator clusters and merges arguments across agents using semantic similarity and base-score aggregation, delivering a unified, tree-structured QBAF. Experiments on GJOpen and Metaculus show that multi-agent configurations, especially three-agent blends with diverse sources, improve forecasting accuracy and provide transparent evidence integration for claim verification.
Abstract
Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.
