Table of Contents
Fetching ...

Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on Trust in Large Language Models

Michelle Cohn, Mahima Pushkarna, Gbolahan O. Olanubi, Joseph M. Moran, Daniel Padgett, Zion Mengesha, Courtney Heldreth

TL;DR

This study investigates how anthropomorphic cues in interactions with large language models shape user trust. Using a 2x2 design crossing Modality (text vs speech+text) and Grammatical Person (I vs the system) with 2,165 US participants over 20 trials, it assesses trial-level and post-trial measures of anthropomorphism and trust. The results show that a computer-generated voice increases perceived anthropomorphism and information accuracy, while the first-person pronoun yields context-specific effects on accuracy and risk; however, overall trust is not uniformly driven by these cues, though higher anthropomorphism scores predict higher trust. The findings inform responsible UX design by highlighting when and how to deploy anthropomorphic cues and by recommending uncertainty cues and careful pronoun use to balance trust and accuracy.

Abstract

People now regularly interface with Large Language Models (LLMs) via speech and text (e.g., Bard) interfaces. However, little is known about the relationship between how users anthropomorphize an LLM system (i.e., ascribe human-like characteristics to a system) and how they trust the information the system provides. Participants (n=2,165; ranging in age from 18-90 from the United States) completed an online experiment, where they interacted with a pseudo-LLM that varied in modality (text only, speech + text) and grammatical person ("I" vs. "the system") in its responses. Results showed that the "speech + text" condition led to higher anthropomorphism of the system overall, as well as higher ratings of accuracy of the information the system provides. Additionally, the first-person pronoun ("I") led to higher information accuracy and reduced risk ratings, but only in one context. We discuss these findings for their implications for the design of responsible, human-generative AI experiences.

Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on Trust in Large Language Models

TL;DR

This study investigates how anthropomorphic cues in interactions with large language models shape user trust. Using a 2x2 design crossing Modality (text vs speech+text) and Grammatical Person (I vs the system) with 2,165 US participants over 20 trials, it assesses trial-level and post-trial measures of anthropomorphism and trust. The results show that a computer-generated voice increases perceived anthropomorphism and information accuracy, while the first-person pronoun yields context-specific effects on accuracy and risk; however, overall trust is not uniformly driven by these cues, though higher anthropomorphism scores predict higher trust. The findings inform responsible UX design by highlighting when and how to deploy anthropomorphic cues and by recommending uncertainty cues and careful pronoun use to balance trust and accuracy.

Abstract

People now regularly interface with Large Language Models (LLMs) via speech and text (e.g., Bard) interfaces. However, little is known about the relationship between how users anthropomorphize an LLM system (i.e., ascribe human-like characteristics to a system) and how they trust the information the system provides. Participants (n=2,165; ranging in age from 18-90 from the United States) completed an online experiment, where they interacted with a pseudo-LLM that varied in modality (text only, speech + text) and grammatical person ("I" vs. "the system") in its responses. Results showed that the "speech + text" condition led to higher anthropomorphism of the system overall, as well as higher ratings of accuracy of the information the system provides. Additionally, the first-person pronoun ("I") led to higher information accuracy and reduced risk ratings, but only in one context. We discuss these findings for their implications for the design of responsible, human-generative AI experiences.
Paper Structure (33 sections, 5 figures, 5 tables)

This paper contains 33 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Procedure for the current study. All participants underwent a technical qualifier and comprehension assessment question. (a) System Assignment: First, all qualifying participants were assigned one of four conditional systems in our experiment, controlled for randomization and even distributions. (b) Experimental trials: Participants were asked to complete 20 trials. After each experimental trial, participants rated accuracy, risk, and validation. (c) Post-experimental trial assessments: After completing all experimental trials, participants rated their overall perceptions of the system, including trustworthiness rating, an Anthropomorphism questionnaire, and a qualitative assessment. In this study, we present the results of anthropomorphism and trust assessments.
  • Figure 2: Mean anthropomorphism score (out of a total of 25) across Modality (speech + text = orange, text only = dark blue) and Grammatical Person (“I found”, “the system found”). Error bars indicate standard error of the mean.
  • Figure 3: Mean ratings for “How accurate is the system’s response?” coded as numeric data (0= “Not at all accurate”, 1 = “Somewhat accurate”, 2 = “Mostly accurate”, 3 = “Completely accurate”) across question contexts. Note that analyses were conducted on the Likert ordinal data and the numeric data is used for visualization purposes only. The experimental conditions included 1) Modality (speech + text = orange, text only = dark blue), and 2) Grammatical Person (“I found”, “the system found”). Error bars indicate standard error of the mean.
  • Figure 4: First, participants were asked to click an “Ask” button to query the system. A “submit” sound was used to confirm that the question had been asked after 100ms. Following a 500ms delay, the question was presented in a speech bubble indicating a user-asked question. The system’s response was buffered with a progress cue, in which the user was shown “...” in the system’s speech bubble for 1500ms. After 2 seconds, the system’s response was in the system’s speech bubble. In the third-person condition, the response started with “Here’s what the system found”. In the first-person condition, the response started with “Here’s what I found”. In the text-only condition, participants saw the system’s typed response only. In the text + speech condition, they saw the typed response and heard a TTS voice reading the same response aloud. Note that there was a delay after displaying the system’s response before showing the three response options, where participants rated 1) perceived accuracy of the information, 2) perceived risk, and 3) follow-up validation.
  • Figure 5: Relationship between Trustworthiness Rating and Anthropomorphism score. Individual points represent group means at each score. (-2 = Completely untrustworthy, -1 = Somewhat untrustworthy, 0 = Neither untrustworthy or trustworthy, 1 = Somewhat trustworthy, 2 = Completely trustworthy). Error bars indicate standard error of the mean.