Table of Contents
Fetching ...

Shared Imagination: LLMs Hallucinate Alike

Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu

TL;DR

The paper introduces Imaginary Question Answering (IQA) as a probe of cross-model similarity among large language models, revealing a strong shared imagination space where models reliably answer purely fictional questions across families (GPT, Claude, Mistral, Llama 3). By prompting 13 models to generate direct or context-based imaginary questions and evaluating answers from potentially different AMs, the study finds substantial non-random correctness, especially for context-based questions (average κ ≈ 54% for DQs and ≈ 86% for CQs, with peaks up to 96%), suggesting fundamental commonalities in their imaginative inferencing. Extensive analyses show the phenomenon persists across topics, is partially explained by data characteristics and generation order but not by simple heuristics like perplexity, and is influenced by factors such as prompt ordering and question length. The work discusses implications for model homogeneity, hallucination detection, and computational creativity, and points to future work including broader model families, mechanistic interpretability, and alternative reasoning prompts to further explore the shared-imagination phenomenon.

Abstract

Despite the recent proliferation of large language models (LLMs), their training recipes -- model architecture, pre-training data and optimization algorithm -- are often very similar. This naturally raises the question of the similarity among the resulting models. In this paper, we propose a novel setting, imaginary question answering (IQA), to better understand model similarity. In IQA, we ask one model to generate purely imaginary questions (e.g., on completely made-up concepts in physics) and prompt another model to answer. Surprisingly, despite the total fictionality of these questions, all models can answer each other's questions with remarkable success, suggesting a "shared imagination space" in which these models operate during such hallucinations. We conduct a series of investigations into this phenomenon and discuss implications on model homogeneity, hallucination, and computational creativity.

Shared Imagination: LLMs Hallucinate Alike

TL;DR

The paper introduces Imaginary Question Answering (IQA) as a probe of cross-model similarity among large language models, revealing a strong shared imagination space where models reliably answer purely fictional questions across families (GPT, Claude, Mistral, Llama 3). By prompting 13 models to generate direct or context-based imaginary questions and evaluating answers from potentially different AMs, the study finds substantial non-random correctness, especially for context-based questions (average κ ≈ 54% for DQs and ≈ 86% for CQs, with peaks up to 96%), suggesting fundamental commonalities in their imaginative inferencing. Extensive analyses show the phenomenon persists across topics, is partially explained by data characteristics and generation order but not by simple heuristics like perplexity, and is influenced by factors such as prompt ordering and question length. The work discusses implications for model homogeneity, hallucination detection, and computational creativity, and points to future work including broader model families, mechanistic interpretability, and alternative reasoning prompts to further explore the shared-imagination phenomenon.

Abstract

Despite the recent proliferation of large language models (LLMs), their training recipes -- model architecture, pre-training data and optimization algorithm -- are often very similar. This naturally raises the question of the similarity among the resulting models. In this paper, we propose a novel setting, imaginary question answering (IQA), to better understand model similarity. In IQA, we ask one model to generate purely imaginary questions (e.g., on completely made-up concepts in physics) and prompt another model to answer. Surprisingly, despite the total fictionality of these questions, all models can answer each other's questions with remarkable success, suggesting a "shared imagination space" in which these models operate during such hallucinations. We conduct a series of investigations into this phenomenon and discuss implications on model homogeneity, hallucination, and computational creativity.
Paper Structure (36 sections, 2 equations, 19 figures, 12 tables)

This paper contains 36 sections, 2 equations, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Imaginary question answering (IQA). Prompt texts are for illustrative purposes, with exact ones shown in Tab. \ref{['tab:dq-prompt']}-\ref{['tab:answer-prompt']}. Top: a question model (QM) is prompted to generate an imaginary multiple-choice question and indicate the correct answer, either directly (left) or based on the previously generated context (right). Bottom left: an answer model (AM) answers the question (with the four choices shuffled), or refuses to answer. Bottom right: we observe non-trivial correctness rate and relatively high answering rate (i.e., low refusal rate), with higher values when AM and QM are the same or from the same model family (shown in Fig. \ref{['fig:main-results']}), and significantly higher values for context-based questions.
  • Figure 2: The correctness and answering rate on direct and context questions for each pair of question model (QM) and answer model (AM). Each pink rectangle represents one model family. For correctness rate, the top-4 highest performing AMs for each QM are shaded. An enlarged version is reproduced in Fig. \ref{['fig:main-results-large']} of App. \ref{['app:main-result']}.
  • Figure 3: Top: question embeddings (computed by text-embedding-3-large) by UMAP, color-coded by topic (left) and question model (right). A triangle marker indicates a DQ, and a circle marker indicates a CQ. Bottom left: average intra-topic cosine similarity between questions generated by different models, with DQs on the lower-left half and CQs on the upper-right half. Bottom right: the color legends for the three plots.
  • Figure 4: Per topic correctness rate of (a subset of) AMs, their average and human guessing, for direct questions (top) and context questions (bottom). Results for other AMs are similar and presented in Fig. \ref{['fig:human-guessing-topic-full']} of App. \ref{['app:human-guessing']}.
  • Figure 5: Fraction of questions whose correct choice is the shortest (in blue), 2nd shortest (in orange), 3rd shortest (in green) and longest (in red), by each QM, for direct questions (left) and context questions (right).
  • ...and 14 more figures