Table of Contents
Fetching ...

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Heimo Müller, Dominik Steiger, Markus Plass, Andreas Holzinger

TL;DR

The System Hallucination Scale is introduced, a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs) and reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

Abstract

We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

TL;DR

The System Hallucination Scale is introduced, a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs) and reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

Abstract

We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.
Paper Structure (26 sections, 8 equations, 7 figures, 10 tables)

This paper contains 26 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Responses to the question: Were the questions of the System Hallucination Scale (SHS) understandable for the study participants? The majority of respondents (87.2%) indicated that the questions were understandable, with a small fraction (12.8%) reporting minor limitations.
  • Figure 2: Responses to the question: Do you consider the questions of the SHS relevant for evaluating LLM outputs? Most respondents (83.0%) rated the SHS questions as relevant, with 14.9% indicating relevance with limitations, and only 2.1% neutral.
  • Figure 3: Responses to the question: Were the response options (Likert / multiple choice) of the SHS appropriate? Participants overwhelmingly (93.6%) indicated that the response options were suitable for expressing their judgments.
  • Figure 4: Responses to the question: Did you need to further explain the wording or meaning of the SHS questions to participants? Most respondents (66.0%) indicated that no additional explanation was required, while 31.9% reported occasional clarification needs.
  • Figure 5: Responses to the question: The questions regarding demographic information were … Nearly all respondents (97.9%) rated the demographic questions as appropriately sized.
  • ...and 2 more figures