Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance
Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, Maarten Sap
TL;DR
Rel-A.I. proposes an interaction-centered framework to evaluate human reliance on LLM outputs, arguing that traditional calibration of uncertainty fails to capture real-world risks in human-LM interactions. The method relies on a self-incentivized trivia task, publicly sourced epistemic markers, and meta-level perception questions to measure in-situ reliance across varied interaction contexts. Across three experiments, the study shows that greetings, prior interaction histories, and domain matter significantly for reliance, with competence perceptions often driving the effect, and computational domains showing higher reliance than non-computational ones. The work highlights the need to shift evaluation away from language quality toward context-aware measures of human reliance to better anticipate safety risks and informs design choices for deploying LLMs in real-world settings.
Abstract
The ability to communicate uncertainty, risk, and limitation is crucial for the safety of large language models. However, current evaluations of these abilities rely on simple calibration, asking whether the language generated by the model matches appropriate probabilities. Instead, evaluation of this aspect of LLM communication should focus on the behaviors of their human interlocutors: how much do they rely on what the LLM says? Here we introduce an interaction-centered evaluation framework called Rel-A.I. (pronounced "rely"}) that measures whether humans rely on LLM generations. We use this framework to study how reliance is affected by contextual features of the interaction (e.g, the knowledge domain that is being discussed), or the use of greetings communicating warmth or competence (e.g., "I'm happy to help!"). We find that contextual characteristics significantly affect human reliance behavior. For example, people rely 10% more on LMs when responding to questions involving calculations and rely 30% more on LMs that are perceived as more competent. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions, and illustrate the need to consider features of the interactional context.
