Can ChatGPT-like Generative Models Guarantee Factual Accuracy? On the Mistakes of New Generation Search Engines
Ruochen Zhao, Xingxuan Li, Yew Ken Chia, Bosheng Ding, Lidong Bing
TL;DR
The paper investigates whether ChatGPT-like models can guarantee factual accuracy in search-enabled conversational systems. It analyzes public demonstrations of Microsoft's new Bing and Google's Bard, categorizing errors into conflicts with sources, non-existent facts, and missing citations, and compares transparency between the two. Findings reveal fabricated numbers, misattributed personal details, incorrect venue data, and other grounding failures in demonstrations, with Bing offering more source links but grounding still imperfect. The work emphasizes the need for verifiable grounding, explicit source transparency, and confidence reporting to build trust in AI-assisted search systems.
Abstract
Although large conversational AI models such as OpenAI's ChatGPT have demonstrated great potential, we question whether such models can guarantee factual accuracy. Recently, technology companies such as Microsoft and Google have announced new services which aim to combine search engines with conversational AI. However, we have found numerous mistakes in the public demonstrations that suggest we should not easily trust the factual claims of the AI models. Rather than criticizing specific models or companies, we hope to call on researchers and developers to improve AI models' transparency and factual correctness.
