Table of Contents
Fetching ...

The Impact of Revealing Large Language Model Stochasticity on Trust, Reliability, and Anthropomorphization

Chelse Swoopes, Tyler Holloway, Elena L. Glassman

TL;DR

This work investigates whether showing multiple simultaneous LLM responses, coupled with a cognitive-support mechanism highlighting shared structure and semantics, can mitigate undue anthropomorphization and overtrust stemming from traditional single-response interfaces. Using a within-subject design with three conditions and measures of workload, trust, and anthropomorphism, the study finds no significant quantitative effects but uncovers rich qualitative themes about cross-verification, perceived model depth, and how response variability shapes user perceptions. The findings suggest that multi-response displays can illuminate the probabilistic nature of LLMs and offer design opportunities, particularly when paired with well-designed cognitive supports, to calibrate user trust and reliance. The work points to a nuanced design space for future interfaces, including trait-based analyses, refined cognitive-load metrics, and exploration of different question types and response consistencies to improve trustworthy human-AI interaction in practical settings.

Abstract

Interfaces for interacting with large language models (LLMs) are often designed to mimic human conversations, typically presenting a single response to user queries. This design choice can obscure the probabilistic and predictive nature of these models, potentially fostering undue trust and over-anthropomorphization of the underlying model. In this paper, we investigate (i) the effect of displaying multiple responses simultaneously as a countermeasure to these issues, and (ii) how a cognitive support mechanism-highlighting structural and semantic similarities across responses-helps users deal with the increased cognitive load of that intervention. We conducted a within-subjects study in which participants inspected responses generated by an LLM under three conditions: one response, ten responses with cognitive support, and ten responses without cognitive support. Participants then answered questions about workload, trust and reliance, and anthropomorphization. We conclude by reporting the results of these studies and discussing future work and design opportunities for future LLM interfaces.

The Impact of Revealing Large Language Model Stochasticity on Trust, Reliability, and Anthropomorphization

TL;DR

This work investigates whether showing multiple simultaneous LLM responses, coupled with a cognitive-support mechanism highlighting shared structure and semantics, can mitigate undue anthropomorphization and overtrust stemming from traditional single-response interfaces. Using a within-subject design with three conditions and measures of workload, trust, and anthropomorphism, the study finds no significant quantitative effects but uncovers rich qualitative themes about cross-verification, perceived model depth, and how response variability shapes user perceptions. The findings suggest that multi-response displays can illuminate the probabilistic nature of LLMs and offer design opportunities, particularly when paired with well-designed cognitive supports, to calibrate user trust and reliance. The work points to a nuanced design space for future interfaces, including trait-based analyses, refined cognitive-load metrics, and exploration of different question types and response consistencies to improve trustworthy human-AI interaction in practical settings.

Abstract

Interfaces for interacting with large language models (LLMs) are often designed to mimic human conversations, typically presenting a single response to user queries. This design choice can obscure the probabilistic and predictive nature of these models, potentially fostering undue trust and over-anthropomorphization of the underlying model. In this paper, we investigate (i) the effect of displaying multiple responses simultaneously as a countermeasure to these issues, and (ii) how a cognitive support mechanism-highlighting structural and semantic similarities across responses-helps users deal with the increased cognitive load of that intervention. We conducted a within-subjects study in which participants inspected responses generated by an LLM under three conditions: one response, ten responses with cognitive support, and ten responses without cognitive support. Participants then answered questions about workload, trust and reliance, and anthropomorphization. We conclude by reporting the results of these studies and discussing future work and design opportunities for future LLM interfaces.

Paper Structure

This paper contains 37 sections, 9 figures.

Figures (9)

  • Figure 1: Single response
  • Figure 2: Ten responses without cognitive support
  • Figure 3: Ten responses with cognitive support
  • Figure 4: IDAQ score distribution across participants (score range: 0-10, with 10 corresponding to greatest tendency to anthropomorphize).
  • Figure 5: Average Need for Cognition grouped by question type
  • ...and 4 more figures