Table of Contents
Fetching ...

The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Alexander S. Choi, Syeda Sabrina Akter, JP Singh, Antonios Anastasopoulos

TL;DR

This study investigates whether Large Language Models (LLMs) can effectively support or inadvertently bias domain-specific analytical work, focusing on AI policy documents from India. Using a two-stage design (Topic Discovery and Topic Assignment) and a controlled, think-aloud protocol, the authors integrate a modified TopicGPT (via GPT-4 with a 128k context window) with expert annotators to compare outcomes with and without LLM assistance. Key findings show that LLMs cover the majority of human-identified topics and dramatically increase labeling speed, but they can introduce anchoring bias and miss low-prevalence, high-sensitivity topics that humans capture. The work highlights a critical efficiency-versus-bias trade-off in human-LLM collaboration and suggests safeguards and human oversight to preserve analytical depth while leveraging AI for rapid analysis and consistency.

Abstract

Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks, leading researchers to use them for time and labor-intensive analyses. However, their capability to handle highly specialized and open-ended tasks in domains like policy studies remains in question. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership. The study, conducted in two stages-Topic Discovery and Topic Assignment-integrates LLMs with expert annotators to observe the impact of LLM suggestions on what is usually human-only analysis. Results indicate that LLM-generated topic lists have significant overlap with human generated topic lists, with minor hiccups in missing document-specific topics. However, LLM suggestions may significantly improve task completion speed, but at the same time introduce anchoring bias, potentially affecting the depth and nuance of the analysis, raising a critical question about the trade-off between increased efficiency and the risk of biased analysis.

The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

TL;DR

This study investigates whether Large Language Models (LLMs) can effectively support or inadvertently bias domain-specific analytical work, focusing on AI policy documents from India. Using a two-stage design (Topic Discovery and Topic Assignment) and a controlled, think-aloud protocol, the authors integrate a modified TopicGPT (via GPT-4 with a 128k context window) with expert annotators to compare outcomes with and without LLM assistance. Key findings show that LLMs cover the majority of human-identified topics and dramatically increase labeling speed, but they can introduce anchoring bias and miss low-prevalence, high-sensitivity topics that humans capture. The work highlights a critical efficiency-versus-bias trade-off in human-LLM collaboration and suggests safeguards and human oversight to preserve analytical depth while leveraging AI for rapid analysis and consistency.

Abstract

Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks, leading researchers to use them for time and labor-intensive analyses. However, their capability to handle highly specialized and open-ended tasks in domains like policy studies remains in question. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership. The study, conducted in two stages-Topic Discovery and Topic Assignment-integrates LLMs with expert annotators to observe the impact of LLM suggestions on what is usually human-only analysis. Results indicate that LLM-generated topic lists have significant overlap with human generated topic lists, with minor hiccups in missing document-specific topics. However, LLM suggestions may significantly improve task completion speed, but at the same time introduce anchoring bias, potentially affecting the depth and nuance of the analysis, raising a critical question about the trade-off between increased efficiency and the risk of biased analysis.
Paper Structure (34 sections, 3 figures, 9 tables)

This paper contains 34 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An overview of the two stages of our user study. In both stages, we have the annotators read the documents and come up with a relevant topic list with (Treatment) and without (Control) the LLM suggestions. By the end of Stage 1, the annotators agree on a Final Topic List, which we use for our Topic Assignment stage. In Stage 2, all annotators conduct the task of assigning the topics to a separate set of documents with (Treatment) and without (Control) the LLM suggestions.
  • Figure 2: The integration process of the topic lists from annotators in different settings for Stage 1. The Final Topic List (H) has some LLM topic overlaps due to the treatment team choosing to use many of the model generated topics and definitions. Most importantly, the LLM generated list doesn't cover 5 topics in any capacity that the control group deemed important.
  • Figure 3: An example of the Label Studio GUI using a mock interview. In order to protect interviewee anonymity, interviews will not be released.