Table of Contents
Fetching ...

CHAIR -- Classifier of Hallucination as Improver

Ao Sun

TL;DR

CHAIR tackles hallucinations in LLMs by leveraging a history of internal token representations across all model layers, using per-layer logits $s_i(t)$ computed as $s_i(t) = lm_head(h_i(t))$ and a compact feature set derived from the sequence $S(t) = {s_1(t), ..., s_L(t)}$. A lightweight classifier is trained on these features—including $Last Score$, $ ext{Mean}$, $Max$, $Min$, $Std$, and $Slope$ with normalization—to detect hallucinations. Evaluations on TruthfulQA and MMLU show substantial improvements, notably in zero-shot scenarios, and demonstrate cross-dataset generalization when training on one dataset and testing on others. This work highlights the value of internal representations for detecting and potentially mitigating hallucinations and motivates integrating logit-patterns into decoding strategies to improve factuality and coherence.

Abstract

In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.

CHAIR -- Classifier of Hallucination as Improver

TL;DR

CHAIR tackles hallucinations in LLMs by leveraging a history of internal token representations across all model layers, using per-layer logits computed as and a compact feature set derived from the sequence . A lightweight classifier is trained on these features—including , , , , , and with normalization—to detect hallucinations. Evaluations on TruthfulQA and MMLU show substantial improvements, notably in zero-shot scenarios, and demonstrate cross-dataset generalization when training on one dataset and testing on others. This work highlights the value of internal representations for detecting and potentially mitigating hallucinations and motivates integrating logit-patterns into decoding strategies to improve factuality and coherence.

Abstract

In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.
Paper Structure (14 sections, 15 equations, 3 figures, 3 tables)

This paper contains 14 sections, 15 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: This figure shows the score in different layers for each answer to two questions from TruthfulQA. The red lines represent scores trace, which is defined in \ref{['eq:def:token-history-score']}, of incorrect answers while the green lines indicate correct answers.In the left graph, although the green curves consistently have lower scores, their pattern of change is noticeably different from that of the red curves. This suggests that even when correct answers have lower scores, their variation across layers contrasts distinctly with that of the incorrect answers which is the information we could take advantage of. A similar phenomenon is observed in the right graph. Even though one of the green curves scores above most of the red ones, its shape closely resembles the other green curves, showing a difference from the red curves. This further demonstrates that correct and incorrect answers exhibit distinctly different score patterns across layers.
  • Figure 2: This illustrates the structure of CHAIR. The blue and green represent the tokens for choices/answers, while the gray boxes denote the transformer layers in the LLaMA model. The purple rectangles correspond to the references of the $lm\_head$ layer, and the yellow rectangle represents the feature extraction module. The classifier processes the inputs to determine whether the output sequence represents a hallucination.
  • Figure 3: This illustrates the Impact of Training Set Size on Model Stability and Robustness, showing the relationship between the size of training data and the results, given the very limited number of parameters used in our CHAIR model. This plot shows the outcomes of 50 experiments for each sample size, where each trial randomly selects $k$ examples for training. The y-axis represents the improvement on MC1, while the x-axis shows the number of training examples. Each bar and scatter point represents the variation in performance improvement across the 50 trials for each sample size. This approach helps us understand how the amount of training data influences the model’s improvement on MC1 under limited parameter conditions.