What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems

Vikram Mohanty; Jude Lim; Kurt Luther

What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems

Vikram Mohanty, Jude Lim, Kurt Luther

TL;DR

This work investigates how users perceive and respond to underlying AI model updates in AI-infused systems, focusing on facial recognition in historical photo identification. Through a controlled online study and a real-world diary deployment on CWPS, the authors reveal that users struggle to notice model changes and rely heavily on perceived accuracy rather than objective cues like latency or result counts. Although newer models can improve technical metrics (precision/recall), this does not reliably translate into improved human-AI team performance, and users develop varied folk theories about model behavior. The findings underscore the need for granular, user-centered communication about model updates and suggest strategies to better align user expectations and workflow with evolving system capabilities across domains.

Abstract

AI models are constantly evolving, with new versions released frequently. Human-AI interaction guidelines encourage notifying users about changes in model capabilities, ideally supported by thorough benchmarking. However, as AI systems integrate into domain-specific workflows, exhaustive benchmarking can become impractical, often resulting in silent or minimally communicated updates. This raises critical questions: Can users notice these updates? What cues do they rely on to distinguish between models? How do such changes affect their behavior and task performance? We address these questions through two studies in the context of facial recognition for historical photo identification: an online experiment examining users' ability to detect model updates, followed by a diary study exploring perceptions in a real-world deployment. Our findings highlight challenges in noticing AI model updates, their impact on downstream user behavior and performance, and how they lead users to develop divergent folk theories. Drawing on these insights, we discuss strategies for effectively communicating model updates in AI-infused systems.

What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems

TL;DR

Abstract

Paper Structure (49 sections, 4 figures, 5 tables)

This paper contains 49 sections, 4 figures, 5 tables.

Introduction
Related Work
User Frustration with Software Updates
User Perceptions of Dynamic AI Systems
Civil War Photo Sleuth and Historical Person Identification
Study 1: Distinguishing AI model updates without explicit communication
Hypotheses
Experiment Setup
Facial Recognition Models
Latency in retrieving results
Interface
Dataset
Participants
Measurement
Analysis
...and 34 more sections

Figures (4)

Figure 6: Participants perceived the new model to be more accurate compared to the old model despite moderate absolute accuracy ratings. The mean is denoted by the red dots in the boxplots.
Figure 7: Factors influencing perceived accuracy. The plot shows effect size estimates (in units of perceived accuracy) for various factors. Positive factors such as facial match responses ($\beta = 23.54$, meaning that each additional facial match response increases perceived accuracy by 23.54 points) and replica responses ($\beta = 16.68$) had the largest impact on increasing perceived accuracy. In contrast, different person responses decreased accuracy ($\beta = -7.08$, meaning each additional response of this type reduces perceived accuracy by 7.08 points). Scrolling through more search results ($\beta = -4.75$) and time spent on the search page ($\beta = -6.29$) negatively impacted perceptions, while the number of search results retrieved had a small positive effect ($\beta = 5.97$). Stars next to effect sizes denote statistical significance, with *** indicating $p < 0.001$, ** indicating $p < 0.01$, and * indicating $p < 0.05$.
Figure 8: Comparison of search results retrieved by the old model (a) and the new model (b). The old model retrieved a significantly larger number of results (598) compared to the new model (28). In addition to fewer results, the new model presents different people, suggesting improvements in the relevance of the results retrieved.
Figure 9: (a) Model Preferences of Users and (b) Usefulness Scores for Old and New Models in the Diary Study.

What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems

TL;DR

Abstract

What Lies Beneath? Exploring the Impact of Underlying AI Model Updates in AI-Infused Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (4)