Table of Contents
Fetching ...

Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT

Guy Bar-Shalom, Fabrizio Frasca, Yaniv Galron, Yftah Ziser, Haggai Maron

TL;DR

This work tackles hallucination detection in large language models by exploiting internal activations. It introduces Activation Tensors and a Vision Transformer–inspired backbone (ACT-ViT) with per-LLM adapters, enabling cross-LLM training and robust zero-shot generalization. Across 15 LLM–dataset combinations, ACT-ViT consistently outperforms traditional probes and probabilistic baselines while delivering near real-time inference and efficient training. The approach demonstrates effective transfer to unseen LLMs through lightweight adaptation and provides a scalable, practical tool for HD in deployment settings.

Abstract

Detecting hallucinations in Large Language Model-generated text is crucial for their safe deployment. While probing classifiers show promise, they operate on isolated layer-token pairs and are LLM-specific, limiting their effectiveness and hindering cross-LLM applications. In this paper, we introduce a novel approach to address these shortcomings. We build on the natural sequential structure of activation data in both axes (layers $\times$ tokens) and advocate treating full activation tensors akin to images. We design ACT-ViT, a Vision Transformer-inspired model that can be effectively and efficiently applied to activation tensors and supports training on data from multiple LLMs simultaneously. Through comprehensive experiments encompassing diverse LLMs and datasets, we demonstrate that ACT-ViT consistently outperforms traditional probing techniques while remaining extremely efficient for deployment. In particular, we show that our architecture benefits substantially from multi-LLM training, achieves strong zero-shot performance on unseen datasets, and can be transferred effectively to new LLMs through fine-tuning. Full code is available at https://github.com/BarSGuy/ACT-ViT.

Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT

TL;DR

This work tackles hallucination detection in large language models by exploiting internal activations. It introduces Activation Tensors and a Vision Transformer–inspired backbone (ACT-ViT) with per-LLM adapters, enabling cross-LLM training and robust zero-shot generalization. Across 15 LLM–dataset combinations, ACT-ViT consistently outperforms traditional probes and probabilistic baselines while delivering near real-time inference and efficient training. The approach demonstrates effective transfer to unseen LLMs through lightweight adaptation and provides a scalable, practical tool for HD in deployment settings.

Abstract

Detecting hallucinations in Large Language Model-generated text is crucial for their safe deployment. While probing classifiers show promise, they operate on isolated layer-token pairs and are LLM-specific, limiting their effectiveness and hindering cross-LLM applications. In this paper, we introduce a novel approach to address these shortcomings. We build on the natural sequential structure of activation data in both axes (layers tokens) and advocate treating full activation tensors akin to images. We design ACT-ViT, a Vision Transformer-inspired model that can be effectively and efficiently applied to activation tensors and supports training on data from multiple LLMs simultaneously. Through comprehensive experiments encompassing diverse LLMs and datasets, we demonstrate that ACT-ViT consistently outperforms traditional probing techniques while remaining extremely efficient for deployment. In particular, we show that our architecture benefits substantially from multi-LLM training, achieves strong zero-shot performance on unseen datasets, and can be transferred effectively to new LLMs through fine-tuning. Full code is available at https://github.com/BarSGuy/ACT-ViT.

Paper Structure

This paper contains 36 sections, 6 equations, 25 figures, 9 tables, 1 algorithm.

Figures (25)

  • Figure 1: ACT-ViT overview: we extract Activation Tensors from (multiple) LLMs, apply Pooling, and project them to a shared space via per-LLM Linear Adapters. A shared ViT Backbone is then applied. ACT-ViT benefits from training on data from multiple LLMs and can be easily fine-tuned to unseen ones.
  • Figure 2: Test AUC heatmaps across layer–token combinations; best layer-token combinations are boxed.
  • Figure 3: Ablation study on the pooling hyperparams $(L_p, N_p)$. Probe[$\ast$] AUC is indicated by dashed green line in the left plot.
  • Figure 4: Zero-shot generalization results across all 15 LLM–dataset combinations, in a "leave-one-dataset-out" setup. Each bar shows the Test AUC score of ACT-ViT, ACT-MLP, and the best probability-based baseline -- Best-Probas -- on a dataset they were not trained on.
  • Figure 5: Low-data regime results, for Mis-7B over the HotpotQA dataset.
  • ...and 20 more figures