Table of Contents
Fetching ...

Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

Michaela Benk, Tim Miller

Abstract

The HCI community commonly evaluates decision support systems based on whether they improve task performance or promote appropriate user reliance. In this work, we look beyond decision outcomes to examine the process through which users develop decision-making strategies. Through a web-based experiment (N = 290) comparing recommendation-driven and hypothesis-driven interaction designs, and using Signal Detection Theory as a theoretical framework, we show that even when performance remains identical, recommendation-driven designs lower participants' thresholds for sufficient evidence and introduce a "hidden bias" in their judgments, resulting in a shifted distribution of errors. Furthermore, we find that experts are just as susceptible to these systemic shifts as novices. We conclude by advocating for a shift in focus: prioritizing decision processes and the preservation of stable evidence standards over performance and reliance alone.

Same Performance, Hidden Bias: Evaluating Hypothesis- and Recommendation-Driven AI

Abstract

The HCI community commonly evaluates decision support systems based on whether they improve task performance or promote appropriate user reliance. In this work, we look beyond decision outcomes to examine the process through which users develop decision-making strategies. Through a web-based experiment (N = 290) comparing recommendation-driven and hypothesis-driven interaction designs, and using Signal Detection Theory as a theoretical framework, we show that even when performance remains identical, recommendation-driven designs lower participants' thresholds for sufficient evidence and introduce a "hidden bias" in their judgments, resulting in a shifted distribution of errors. Furthermore, we find that experts are just as susceptible to these systemic shifts as novices. We conclude by advocating for a shift in focus: prioritizing decision processes and the preservation of stable evidence standards over performance and reliance alone.
Paper Structure (16 sections, 3 figures)

This paper contains 16 sections, 3 figures.

Figures (3)

  • Figure 1: Experimental Protocol Overview. The study pipeline comprises a pre-experimental assessment of individual differences (Prior Knowledge, NFC), followed by random assignment into experimental conditions. The core task is split into 10 training and 10 testing trials where behavioral metrics are captured. The session concludes with a post-hoc evaluation of cognitive load and tool engagement.
  • Figure 2: Example of task stimuli, using the OnlyConnect dataset, designed to evaluate participants' ability to determine the presence of a theme.
  • Figure 3: (left) Sensitivity ($d'$) and (right) decision criterion ($c$) across two study phases. While sensitivity improved for all participants, criterion shifts were specific to recommendation-driven designs.