Table of Contents
Fetching ...

Comparing Traditional and LLM-based Search for Consumer Choice: A Randomized Experiment

Sofia Eleni Spatharioti, David M. Rothschild, Daniel G. Goldstein, Jake M. Hofman

TL;DR

This study compares traditional web search with LLM-based search in consumer vehicle research to understand efficiency, accuracy, and user experience. It uses two randomized experiments: the first shows LLMs speed task completion and raise user satisfaction but can degrade accuracy on difficult queries, due to misreported cargo space; the second introduces color-coded confidence cues that significantly improve accuracy on challenging tasks by encouraging follow-up verification. The findings suggest LLM-based search can boost productivity while uncertainty communication can mitigate overreliance on erroneous outputs, with implications for designing AI-enabled search tools. The work highlights practical steps for enhancing trust and accuracy in AI-assisted decision making.

Abstract

Recent advances in the development of large language models are rapidly changing how online applications function. LLM-based search tools, for instance, offer a natural language interface that can accommodate complex queries and provide detailed, direct responses. At the same time, there have been concerns about the veracity of the information provided by LLM-based tools due to potential mistakes or fabrications that can arise in algorithmically generated text. In a set of online experiments we investigate how LLM-based search changes people's behavior relative to traditional search, and what can be done to mitigate overreliance on LLM-based output. Participants in our experiments were asked to solve a series of decision tasks that involved researching and comparing different products, and were randomly assigned to do so with either an LLM-based search tool or a traditional search engine. In our first experiment, we find that participants using the LLM-based tool were able to complete their tasks more quickly, using fewer but more complex queries than those who used traditional search. Moreover, these participants reported a more satisfying experience with the LLM-based search tool. When the information presented by the LLM was reliable, participants using the tool made decisions with a comparable level of accuracy to those using traditional search, however we observed overreliance on incorrect information when the LLM erred. Our second experiment further investigated this issue by randomly assigning some users to see a simple color-coded highlighting scheme to alert them to potentially incorrect or misleading information in the LLM responses. Overall we find that this confidence-based highlighting substantially increases the rate at which users spot incorrect information, improving the accuracy of their overall decisions while leaving most other measures unaffected.

Comparing Traditional and LLM-based Search for Consumer Choice: A Randomized Experiment

TL;DR

This study compares traditional web search with LLM-based search in consumer vehicle research to understand efficiency, accuracy, and user experience. It uses two randomized experiments: the first shows LLMs speed task completion and raise user satisfaction but can degrade accuracy on difficult queries, due to misreported cargo space; the second introduces color-coded confidence cues that significantly improve accuracy on challenging tasks by encouraging follow-up verification. The findings suggest LLM-based search can boost productivity while uncertainty communication can mitigate overreliance on erroneous outputs, with implications for designing AI-enabled search tools. The work highlights practical steps for enhancing trust and accuracy in AI-assisted decision making.

Abstract

Recent advances in the development of large language models are rapidly changing how online applications function. LLM-based search tools, for instance, offer a natural language interface that can accommodate complex queries and provide detailed, direct responses. At the same time, there have been concerns about the veracity of the information provided by LLM-based tools due to potential mistakes or fabrications that can arise in algorithmically generated text. In a set of online experiments we investigate how LLM-based search changes people's behavior relative to traditional search, and what can be done to mitigate overreliance on LLM-based output. Participants in our experiments were asked to solve a series of decision tasks that involved researching and comparing different products, and were randomly assigned to do so with either an LLM-based search tool or a traditional search engine. In our first experiment, we find that participants using the LLM-based tool were able to complete their tasks more quickly, using fewer but more complex queries than those who used traditional search. Moreover, these participants reported a more satisfying experience with the LLM-based search tool. When the information presented by the LLM was reliable, participants using the tool made decisions with a comparable level of accuracy to those using traditional search, however we observed overreliance on incorrect information when the LLM erred. Our second experiment further investigated this issue by randomly assigning some users to see a simple color-coded highlighting scheme to alert them to potentially incorrect or misleading information in the LLM responses. Overall we find that this confidence-based highlighting substantially increases the rate at which users spot incorrect information, improving the accuracy of their overall decisions while leaving most other measures unaffected.
Paper Structure (21 sections, 14 figures, 1 table)

This paper contains 21 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: Example of the same query “what is the cargo space of a 2020 jeep wrangler” in (left) Bing’s traditional search on May 15, 2023 and (right) Bing’s conversational search on May 15, 2023.
  • Figure 2: Screenshots of the interface for Experiment 1.
  • Figure 3: Experiment 1: Efficiency results
  • Figure 4: Complexity of queries issued by condition and task (Experiment 1). Each point represents an average of the complexity of all of the queries issued by a given participant in a given task.
  • Figure 5: Accuracy by condition (Experiment 1). The first four tasks are routine (comparisons between 8 popular SUV models), whereas the fifth is a comparison selected for which the LLM tends to err. Points represent means and error bars are plus or minus one standard error.
  • ...and 9 more figures