Table of Contents
Fetching ...

One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations

Yoonjoo Lee, Kihoon Son, Tae Soo Kim, Jisu Kim, John Joon Young Chung, Eytan Adar, Juho Kim

TL;DR

This paper investigates how presenting multiple, potentially inconsistent LLM outputs affects user perception of AI capacity and information comprehension. It identifies five inconsistency types and conducts a randomized experiment (N=252) varying the number of passages (one, two, three) and model accuracy. The findings show that inconsistency lowers perceived AI capacity but can improve comprehension, especially with two passages, while higher model accuracy and cognitive-load considerations modulate these effects. The authors propose design guidelines to transparently reveal model limitations and promote critical usage, including tailoring output counts and highlighting differences between passages to aid sensemaking.

Abstract

As Large Language Models (LLMs) are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. If run again, the LLM may correct itself and produce the correct answer. Unfortunately, most LLM-powered systems resort to single results which, correct or not, users accept. Having the LLM produce multiple outputs may help identify disagreements or alternatives. However, it is not obvious how the user will interpret conflicts or inconsistencies. To this end, we investigate how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs. Through a preliminary study, we identified five types of output inconsistencies. Based on these categories, we conducted a study (N=252) in which participants were given one or more LLM-generated passages to an information-seeking question. We found that inconsistency within multiple LLM-generated outputs lowered the participants' perceived AI capacity, while also increasing their comprehension of the given information. Specifically, we observed that this positive effect of inconsistencies was most significant for participants who read two passages, compared to those who read three. Based on these findings, we present design implications that, instead of regarding LLM output inconsistencies as a drawback, we can reveal the potential inconsistencies to transparently indicate the limitations of these models and promote critical LLM usage.

One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations

TL;DR

This paper investigates how presenting multiple, potentially inconsistent LLM outputs affects user perception of AI capacity and information comprehension. It identifies five inconsistency types and conducts a randomized experiment (N=252) varying the number of passages (one, two, three) and model accuracy. The findings show that inconsistency lowers perceived AI capacity but can improve comprehension, especially with two passages, while higher model accuracy and cognitive-load considerations modulate these effects. The authors propose design guidelines to transparently reveal model limitations and promote critical usage, including tailoring output counts and highlighting differences between passages to aid sensemaking.

Abstract

As Large Language Models (LLMs) are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. If run again, the LLM may correct itself and produce the correct answer. Unfortunately, most LLM-powered systems resort to single results which, correct or not, users accept. Having the LLM produce multiple outputs may help identify disagreements or alternatives. However, it is not obvious how the user will interpret conflicts or inconsistencies. To this end, we investigate how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs. Through a preliminary study, we identified five types of output inconsistencies. Based on these categories, we conducted a study (N=252) in which participants were given one or more LLM-generated passages to an information-seeking question. We found that inconsistency within multiple LLM-generated outputs lowered the participants' perceived AI capacity, while also increasing their comprehension of the given information. Specifically, we observed that this positive effect of inconsistencies was most significant for participants who read two passages, compared to those who read three. Based on these findings, we present design implications that, instead of regarding LLM output inconsistencies as a drawback, we can reveal the potential inconsistencies to transparently indicate the limitations of these models and promote critical LLM usage.
Paper Structure (34 sections, 6 figures, 3 tables)

This paper contains 34 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overall procedure of the experiment. (a) After an introduction to the experiment and task, (b) participants answer the information-seeking question before receiving the passages (i.e., pre-task questionnaire) and rate their confidence. (c) Participants read the AI-generated passage(s) and (d) answer three comprehension questions. (e) Participants respond to post-task questionnaire.
  • Figure 2: Mean values and 95% confidence intervals for participants' perceived AI capacity and comprehension scores according to whether they received (a) consistent or inconsistent passages, and (b) their experimental condition (i.e., number of passages).
  • Figure 3: Mean values and 95% confidence intervals for (a)perceived AI capacity and (b)comprehension scores for each subcondition. The lines above the x-axis labels indicate that these subconditions have the same ratio of correct passages (e.g., [xxo], [xox], and [oxx] provide passages where one-third of them have the correct information).
  • Figure 4: (a) Comparison of mean perceived AI capacity and 95% intervals across different levels of AI model accuracy. (b) Same visualization for comprehension scores.
  • Figure 5: (a) Coefficient of the Double and Triple conditions on perceived AI capacity across accuracy levels for the AI model (50%-98%, step of 2%). The shaded region represents the 95% confidence intervals. Each 10% step of accuracy marked with a cross represents that the coefficient was statistically significant (p<.05). (b) Same visualization for comprehension scores.
  • ...and 1 more figures