Table of Contents
Fetching ...

Retrieval and Argumentation Enhanced Multi-Agent LLMs for Judgmental Forecasting

Deniz Gorur, Antonio Rago, Francesca Toni

TL;DR

This work frames judgmental forecasting as claim verification and proposes a novel multi-agent framework that aggregates Quantitative Bipolar Argumentation Frameworks (QBAFs) produced by diverse LLM-based agents. By introducing ArgLLM, Retrieval-Augmented ArgLLM (RAG-ArgLLM), and RbAM agents, the approach blends internal LLM reasoning with external grounding to produce robust, explainable QBAFs. A central Multi-Agent QBAF Combinator clusters and merges arguments across agents using semantic similarity and base-score aggregation, delivering a unified, tree-structured QBAF. Experiments on GJOpen and Metaculus show that multi-agent configurations, especially three-agent blends with diverse sources, improve forecasting accuracy and provide transparent evidence integration for claim verification.

Abstract

Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.

Retrieval and Argumentation Enhanced Multi-Agent LLMs for Judgmental Forecasting

TL;DR

This work frames judgmental forecasting as claim verification and proposes a novel multi-agent framework that aggregates Quantitative Bipolar Argumentation Frameworks (QBAFs) produced by diverse LLM-based agents. By introducing ArgLLM, Retrieval-Augmented ArgLLM (RAG-ArgLLM), and RbAM agents, the approach blends internal LLM reasoning with external grounding to produce robust, explainable QBAFs. A central Multi-Agent QBAF Combinator clusters and merges arguments across agents using semantic similarity and base-score aggregation, delivering a unified, tree-structured QBAF. Experiments on GJOpen and Metaculus show that multi-agent configurations, especially three-agent blends with diverse sources, improve forecasting accuracy and provide transparent evidence integration for claim verification.

Abstract

Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.

Paper Structure

This paper contains 38 sections, 11 theorems, 3 figures, 23 tables, 1 algorithm.

Key Result

Lemma 4.1

Let $x^*,y^* \in \mathcal{X}^*$. Then, $(x^*,y^*) \in \mathcal{A}^*$ (or $(x^*, y^*) \in \mathcal{S}^*$) iff $\exists x \in x^*$ and $\exists y \in y^*$ such that $\exists {i \in \{1,\dots,n\}}$ where $(x, y) \in \mathcal{A}_i$ (or $(x, y) \in \mathcal{S}_i$, respectively).

Figures (3)

  • Figure 1: The overall pipeline. The 'QBAF Generator Agents' module can be instantiated with ArgLLM agents (baseline) or our two RAG-based methods: Retrieval-Augmented ArgLLM agents and RbAM agents, both taking in input external sources . The 'Multi-Agent QBAF Combinator' module then takes the generated QBAFs (two in the figure, but our method applies to any number) and (1) calculates similarity between arguments in the QBAFs, (2) combines similar arguments to obtain a single BAF $\mathcal{B}^*$, and (3) aggregates the base scores of the combined arguments to obtain base scores $\tau^*$, leading to a combined QBAF $\mathcal{Q}^*$.
  • Figure 2: The Multi-Agent QBAF Combinator takes two initial QBAFs, $\mathcal{Q}_1$ (top-left) and $\mathcal{Q}_2$ (bottom-left), as input and outputs a single, merged QBAF $\mathcal{Q}^*$ (right).
  • Figure 3: An example of how an RAG-ArgLLM agent (bottom) can improve the results compared to an ArgLLM agent (top). The example is taken from the Metaculus dataset and its claim is false. The ArgLLM agent incorrectly predicts it as false whereas the RAG-ArgLLM agent correctly predicts it.

Theorems & Definitions (22)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 4.1
  • Example 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Example 2
  • Proposition 4
  • ...and 12 more