Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Nick Jiang; Anish Kachinthaya; Suzie Petryk; Yossi Gandelsman

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Nick Jiang, Anish Kachinthaya, Suzie Petryk, Yossi Gandelsman

TL;DR

This work analyzes hallucinations in vision-language models by probing internal image representations with a logit lens to reveal object presence. It introduces ProjectAway, a linear editing procedure that orthogonalizes image features against target object text embeddings to erase hallucinated content from captions while preserving performance. The authors demonstrate practical applications in hallucination detection, targeted removal, and zero-shot segmentation, achieving significant reductions in hallucinations on COCO2014 and competitive segmentation results. By revealing and editing latent representations, the paper offers a path to more reliable VLMs and new capabilities without external detectors or heavy fine-tuning.

Abstract

We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

TL;DR

Abstract

Paper Structure (29 sections, 3 equations, 18 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 18 figures, 8 tables, 1 algorithm.

Introduction
Related work
Interpreting Latent Representations in Language Models
Interpreting latent representations in Vision Models
Detecting and reducing VLM hallucinations
Extracting Knowledge from VLMs
Preliminaries
Applying Logit Lens on VLMs
Erasing knowledge from VLMs
Erasing objects from image representations
Removing objects one by one
Mass-removing objects
Ablation Study: mass-removing hallucinations
Applications
Hallucination Detection
...and 14 more sections

Figures (18)

Figure 1: Interpreting VLM internal image representations. (a) Given a VLM, (b) we unembed the latent representations from image embeddings to the vocabulary and classify hallucinations. We remove hallucinations by (c) linearly editing them out of the latent representations.
Figure 2: Comparison of internal confidence in objects present and not present in the image. We examine the internal confidence of COCO objects that exist and do not exist in the image within intermediate VLM image representations. We observe that objects that do not exist in the image have lower internal confidence.
Figure 3: Localizing objects using internal confidence values. We find the probabilities of objects through layers of the language model for every image embedding in LLaVA. We use the highest layer probability per image embedding to localize an object within the image.
Figure 4: ProjectAway
Figure 5: Qualitative results for mass object removal. We present example images and their captions after mass-removing hallucinations (red) with ProjectAway., which can effectively remove hallucinations while preserving, even increasing, correctly detected objects (green).
...and 13 more figures

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

TL;DR

Abstract

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Authors

TL;DR

Abstract

Table of Contents

Figures (18)