Table of Contents
Fetching ...

Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses

Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, Chien-Sheng Wu

TL;DR

This work first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations, and proposes 16 answer engine design recommendations, linked to 8 metrics.

Abstract

Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems' limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (You.com, Perplexity.ai, BingChat) quantifies common limitations (e.g., frequent hallucination, inaccurate citation) and unique features (e.g., variation in answer confidence), with results mirroring user study insights. We release our Answer Engine Evaluation benchmark (AEE) to facilitate transparent evaluation of LLM-based applications.

Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses

TL;DR

This work first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations, and proposes 16 answer engine design recommendations, linked to 8 metrics.

Abstract

Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems' limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (You.com, Perplexity.ai, BingChat) quantifies common limitations (e.g., frequent hallucination, inaccurate citation) and unique features (e.g., variation in answer confidence), with results mirroring user study insights. We release our Answer Engine Evaluation benchmark (AEE) to facilitate transparent evaluation of LLM-based applications.

Paper Structure

This paper contains 42 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: High-level diagram of the three parts to the 90-minute usability study we conducted, and the work that derives from study findings: design recommendations, and the Answer Engine Evaluation (AEE) framework.
  • Figure 2: Comparison of outputs from [A] Perplexity, which reflects the bias inherent in the question by presenting only a one-sided response, and [B] YouChat, which acknowledges multiple perspectives, avoiding presenting incomplete information.
  • Figure 3: Comparison of outputs from [A] Perplexity, which lacks citations for the points generated, causing confusion on the actual source of each sentence, and [B] Copilot, which effectively indicates the sources for each statement.
  • Figure 4: Results generated by Perplexity [A] and the corresponding sources retrieved [B]. The image illustrates how the model retrieved 8 sources, many of which are duplicates of the same source. Despite this, the model cites them differently, creating an illusion of varied content when it is actually the same.
  • Figure 5: Violin plot showcasing the distribution of number of sources hovered and clicked on by participants for Traditional Search versus Answer Engines.
  • ...and 2 more figures