Table of Contents
Fetching ...

Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance

Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, Maarten Sap

TL;DR

Rel-A.I. proposes an interaction-centered framework to evaluate human reliance on LLM outputs, arguing that traditional calibration of uncertainty fails to capture real-world risks in human-LM interactions. The method relies on a self-incentivized trivia task, publicly sourced epistemic markers, and meta-level perception questions to measure in-situ reliance across varied interaction contexts. Across three experiments, the study shows that greetings, prior interaction histories, and domain matter significantly for reliance, with competence perceptions often driving the effect, and computational domains showing higher reliance than non-computational ones. The work highlights the need to shift evaluation away from language quality toward context-aware measures of human reliance to better anticipate safety risks and informs design choices for deploying LLMs in real-world settings.

Abstract

The ability to communicate uncertainty, risk, and limitation is crucial for the safety of large language models. However, current evaluations of these abilities rely on simple calibration, asking whether the language generated by the model matches appropriate probabilities. Instead, evaluation of this aspect of LLM communication should focus on the behaviors of their human interlocutors: how much do they rely on what the LLM says? Here we introduce an interaction-centered evaluation framework called Rel-A.I. (pronounced "rely"}) that measures whether humans rely on LLM generations. We use this framework to study how reliance is affected by contextual features of the interaction (e.g, the knowledge domain that is being discussed), or the use of greetings communicating warmth or competence (e.g., "I'm happy to help!"). We find that contextual characteristics significantly affect human reliance behavior. For example, people rely 10% more on LMs when responding to questions involving calculations and rely 30% more on LMs that are perceived as more competent. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions, and illustrate the need to consider features of the interactional context.

Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance

TL;DR

Rel-A.I. proposes an interaction-centered framework to evaluate human reliance on LLM outputs, arguing that traditional calibration of uncertainty fails to capture real-world risks in human-LM interactions. The method relies on a self-incentivized trivia task, publicly sourced epistemic markers, and meta-level perception questions to measure in-situ reliance across varied interaction contexts. Across three experiments, the study shows that greetings, prior interaction histories, and domain matter significantly for reliance, with competence perceptions often driving the effect, and computational domains showing higher reliance than non-computational ones. The work highlights the need to shift evaluation away from language quality toward context-aware measures of human reliance to better anticipate safety risks and informs design choices for deploying LLMs in real-world settings.

Abstract

The ability to communicate uncertainty, risk, and limitation is crucial for the safety of large language models. However, current evaluations of these abilities rely on simple calibration, asking whether the language generated by the model matches appropriate probabilities. Instead, evaluation of this aspect of LLM communication should focus on the behaviors of their human interlocutors: how much do they rely on what the LLM says? Here we introduce an interaction-centered evaluation framework called Rel-A.I. (pronounced "rely"}) that measures whether humans rely on LLM generations. We use this framework to study how reliance is affected by contextual features of the interaction (e.g, the knowledge domain that is being discussed), or the use of greetings communicating warmth or competence (e.g., "I'm happy to help!"). We find that contextual characteristics significantly affect human reliance behavior. For example, people rely 10% more on LMs when responding to questions involving calculations and rely 30% more on LMs that are perceived as more competent. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions, and illustrate the need to consider features of the interactional context.
Paper Structure (36 sections, 7 figures, 11 tables)

This paper contains 36 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: We introduce Rel-A.I., an interaction-centered approach to evaluating LLM risks based on human reliance behaviors. We study the effects of interactional contexts on reliance and find that reliance isn't solely contingent on the quality of model answer, and that contextual cues such as warmth heavily weigh on human perception of model competence and reliability.
  • Figure 2: The Rel-A.I. approach consists of three components. Self-incentivized task provides participants with an interactive game-like setting to engage with the agent. Epistemic markers and interaction cues are altered based on the experimental setting. Meta-level perception responses from participants.
  • Figure 3: Users are based on their change in perception between $A_{control}$ and $A_{exper}$ and observe the reliance rates in each cluster. We see that changes to perceptions of competence, warmth, and humanlikeness are strongly correlated with changes to reliance rates.
  • Figure 4: Reliance of expressions from $B_{conf}$ versus from $B_{unconf}$ (not confident). The less frequently the expressions were relied on in the $B_{conf}$, the greater the difference between $B_{unconf}$ and $B_{conf}$.
  • Figure 5: Task Consent Form
  • ...and 2 more figures