Emerging Reliance Behaviors in Human-AI Content Grounded Data Generation: The Role of Cognitive Forcing Functions and Hallucinations
Zahra Ashktorab, Qian Pan, Werner Geyer, Michael Desmond, Marina Danilevsky, James M. Johnson, Casey Dugan, Michelle Bachman
TL;DR
This study investigates how hallucinations and Cognitive Forcing Functions (CFFs) affect the quality and reliance patterns in human-AI co-creation of content-grounded data for fine-tuning LLMs in HR/customer-support contexts. Using a mixed between-within design with 34 participants across 8 tasks, the authors manipulate CFF type and presence alongside hallucination presence, employing a rubric-based evaluation of faithfulness, accuracy, completeness, and AI usage. They find that hallucinations substantially degrade data quality and that CFFs do not reliably mitigate this effect, though they influence how users engage with AI suggestions and create novel reliance behaviors (e.g., appending AI content to correct answers). The results yield a nuanced view of AI reliance in co-creative tasks, highlight the need for conditional CFF deployment, and propose a taxonomy of reliers along with a practical data-quality rubric for improving AI-assisted data generation. These findings have practical implications for designing data-collection pipelines and evaluation schemes to produce higher-quality fine-tuning data for content-grounded LLMs in organizational settings.
Abstract
We investigate the impact of hallucinations and Cognitive Forcing Functions in human-AI collaborative content-grounded data generation, focusing on the use of Large Language Models (LLMs) to assist in generating high quality conversational data. Through a study with 34 users who each completed 8 tasks (n=272), we found that hallucinations significantly reduce data quality. While Cognitive Forcing Functions do not always alleviate these effects, their presence influences how users integrate AI responses. Specifically, we observed emerging reliance behaviors, with users often appending AI-generated responses to their correct answers, even when the AI's suggestions conflicted. This points to a potential drawback of Cognitive Forcing Functions, particularly when AI suggestions are inaccurate. Users who overrelied on AI-generated text produced lower quality data, emphasizing the nuanced dynamics of overreliance in human-LLM collaboration compared to traditional human-AI decision-making.
