Table of Contents
Fetching ...

MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

Nicole Cho, William Watson

TL;DR

MultiQ&A addresses the problem of LLM hallucinations and opaque reasoning by introducing a crowdsourced robustness evaluator that perturbes a single question $q_0$ into $v+1$ variants and collects $v+1$ independent answers. The framework, built from a Query Rewriter, Answer Generator, and Aggregator, analyzes robustness through semantic clustering, re-ranking, and a mix of supervised and unsupervised metrics, demonstrated on 12 QA datasets with $1.9$ million perturbations and $2.3$ million answers. Across extracts, MC, and abstractive tasks, GPT-3.5-Turbo shows relative robustness under perturbations, with notable exceptions such as MathQA, and the approach emphasizes uncertainty, agreement, and reliability metrics to reveal disagreements and potential hallucinations. This method offers a scalable, interpretable tool for institutional LLM adoption, enabling measurement of confidence, consistency, and the propensity for hallucinations under adversarial input variations.

Abstract

One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.

MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers

TL;DR

MultiQ&A addresses the problem of LLM hallucinations and opaque reasoning by introducing a crowdsourced robustness evaluator that perturbes a single question into variants and collects independent answers. The framework, built from a Query Rewriter, Answer Generator, and Aggregator, analyzes robustness through semantic clustering, re-ranking, and a mix of supervised and unsupervised metrics, demonstrated on 12 QA datasets with million perturbations and million answers. Across extracts, MC, and abstractive tasks, GPT-3.5-Turbo shows relative robustness under perturbations, with notable exceptions such as MathQA, and the approach emphasizes uncertainty, agreement, and reliability metrics to reveal disagreements and potential hallucinations. This method offers a scalable, interpretable tool for institutional LLM adoption, enabling measurement of confidence, consistency, and the propensity for hallucinations under adversarial input variations.

Abstract

One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.

Paper Structure

This paper contains 42 sections, 9 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: System Overview for MultiQ&A: A single question $q_0$, supplied by the user, is perturbed in $v$ different ways (while retaining the original question via the identify function). Each perturbed question $q_i\in\mathcal{T}$ is independently answered by the Answer Generator agent. Finally, several metrics are computed for the cohort of answers based on the perturbations. In a practical setting, these variations can be fed into an Aggregator, which organizes and re-ranks the answers according to the user's preferences and the original question. Aggregated statistics are compiled from $1,000$ random permutations of the result set across raters, with labels remapped, thus simulating large-scale item analysis.