Table of Contents
Fetching ...

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

Alireza Amiri-Margavi, Iman Jebellat, Ehsan Jebellat, Seyed Pouyan Mousavi Davoudi

TL;DR

Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models.

Abstract

We propose a collaborative framework in which multiple large language models -- including GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash -- generate and answer complex, PhD-level statistical questions when definitive ground truth is unavailable. Our study examines how inter-model consensus improves both response reliability and identifies the quality of the generated questions. Employing chi-square tests, Fleiss' Kappa, and confidence interval analysis, we quantify consensus rates and inter-rater agreement to assess both response precision and question quality. Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models. In contrast, Gemini and LLaMA exhibit greater variability and lower reliability in question formulation. These findings demonstrate that collaborative interactions among large language models enhance response reliability and provide valuable insights for optimizing AI-driven collaborative reasoning systems.

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

TL;DR

Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models.

Abstract

We propose a collaborative framework in which multiple large language models -- including GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash -- generate and answer complex, PhD-level statistical questions when definitive ground truth is unavailable. Our study examines how inter-model consensus improves both response reliability and identifies the quality of the generated questions. Employing chi-square tests, Fleiss' Kappa, and confidence interval analysis, we quantify consensus rates and inter-rater agreement to assess both response precision and question quality. Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models. In contrast, Gemini and LLaMA exhibit greater variability and lower reliability in question formulation. These findings demonstrate that collaborative interactions among large language models enhance response reliability and provide valuable insights for optimizing AI-driven collaborative reasoning systems.

Paper Structure

This paper contains 7 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Consensus rate overview showing the levels of agreement among the models when different LLMs generate the questions. The x-axis represents the question-generating LLM model, and the y-axis indicates the percentage of responses in each agreement category.
  • Figure 2: Majority vote and reliability percentages across different models. The majority vote percentage indicates instances where at least two models agree, while the reliability percentage reflects instances where two or more models agree with the question-generating LLM's answer.