MultiQ&A: An Analysis in Measuring Robustness via Automated Crowdsourcing of Question Perturbations and Answers
Nicole Cho, William Watson
TL;DR
MultiQ&A addresses the problem of LLM hallucinations and opaque reasoning by introducing a crowdsourced robustness evaluator that perturbes a single question $q_0$ into $v+1$ variants and collects $v+1$ independent answers. The framework, built from a Query Rewriter, Answer Generator, and Aggregator, analyzes robustness through semantic clustering, re-ranking, and a mix of supervised and unsupervised metrics, demonstrated on 12 QA datasets with $1.9$ million perturbations and $2.3$ million answers. Across extracts, MC, and abstractive tasks, GPT-3.5-Turbo shows relative robustness under perturbations, with notable exceptions such as MathQA, and the approach emphasizes uncertainty, agreement, and reliability metrics to reveal disagreements and potential hallucinations. This method offers a scalable, interpretable tool for institutional LLM adoption, enabling measurement of confidence, consistency, and the propensity for hallucinations under adversarial input variations.
Abstract
One critical challenge in the institutional adoption journey of Large Language Models (LLMs) stems from their propensity to hallucinate in generated responses. To address this, we propose MultiQ&A, a systematic approach for evaluating the robustness and consistency of LLM-generated answers. We demonstrate MultiQ&A's ability to crowdsource question perturbations and their respective answers through independent LLM agents at scale. Our experiments culminated in the examination of 1.9 million question perturbations and 2.3 million answers. Furthermore, MultiQ&A shows that ensembled LLMs, such as gpt-3.5-turbo, remain relatively robust and consistent under perturbations. MultiQ&A provides clarity in the response generation space, offering an effective method for inspecting disagreements and variability. Therefore, our system offers a potential framework for institutional LLM adoption with the ability to measure confidence, consistency, and the quantification of hallucinations.
