Table of Contents
Fetching ...

Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models

Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li

TL;DR

The paper tackles hallucination and cross-modal misalignment in Vision-Language Models by extending Representation Engineering (RepE) to multimodal settings. It introduces a theoretical framework where the principal eigenvector $u_1$ acts as a stable backbone for neural activity across layers, while a shrinking spectral gap allows subdominant eigenvectors to encode distinctions between concepts. Empirically, it validates these ideas on COCO with the IDEFICS2-8B VLM, showing strong alignment between $u_1$ and attention outputs and enabling reading and steering of high-level concepts such as honesty and fairness via LAT analyses. The work offers a principled, interpretable approach to improving robustness, fairness, and transparency in multimodal AI systems and lays groundwork for broader bias mitigation and controllable representation in VLMs.

Abstract

Representation Engineering (RepE) has emerged as a powerful paradigm for enhancing AI transparency by focusing on high-level representations rather than individual neurons or circuits. It has proven effective in improving interpretability and control, showing that representations can emerge, propagate, and shape final model outputs in large language models (LLMs). However, in Vision-Language Models (VLMs), visual input can override factual linguistic knowledge, leading to hallucinated responses that contradict reality. To address this challenge, we make the first attempt to extend RepE to VLMs, analyzing how multimodal representations are preserved and transformed. Building on our findings and drawing inspiration from successful RepE applications, we develop a theoretical framework that explains the stability of neural activity across layers using the principal eigenvector, uncovering the underlying mechanism of RepE. We empirically validate these instrinsic properties, demonstrating their broad applicability and significance. By bridging theoretical insights with empirical validation, this work transforms RepE from a descriptive tool into a structured theoretical framework, opening new directions for improving AI robustness, fairness, and transparency.

Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models

TL;DR

The paper tackles hallucination and cross-modal misalignment in Vision-Language Models by extending Representation Engineering (RepE) to multimodal settings. It introduces a theoretical framework where the principal eigenvector acts as a stable backbone for neural activity across layers, while a shrinking spectral gap allows subdominant eigenvectors to encode distinctions between concepts. Empirically, it validates these ideas on COCO with the IDEFICS2-8B VLM, showing strong alignment between and attention outputs and enabling reading and steering of high-level concepts such as honesty and fairness via LAT analyses. The work offers a principled, interpretable approach to improving robustness, fairness, and transparency in multimodal AI systems and lays groundwork for broader bias mitigation and controllable representation in VLMs.

Abstract

Representation Engineering (RepE) has emerged as a powerful paradigm for enhancing AI transparency by focusing on high-level representations rather than individual neurons or circuits. It has proven effective in improving interpretability and control, showing that representations can emerge, propagate, and shape final model outputs in large language models (LLMs). However, in Vision-Language Models (VLMs), visual input can override factual linguistic knowledge, leading to hallucinated responses that contradict reality. To address this challenge, we make the first attempt to extend RepE to VLMs, analyzing how multimodal representations are preserved and transformed. Building on our findings and drawing inspiration from successful RepE applications, we develop a theoretical framework that explains the stability of neural activity across layers using the principal eigenvector, uncovering the underlying mechanism of RepE. We empirically validate these instrinsic properties, demonstrating their broad applicability and significance. By bridging theoretical insights with empirical validation, this work transforms RepE from a descriptive tool into a structured theoretical framework, opening new directions for improving AI robustness, fairness, and transparency.

Paper Structure

This paper contains 15 sections, 11 equations, 10 figures.

Figures (10)

  • Figure 1: The overview of general representation engineering pipeline. Given an image, a model generates text conditioned on prompts emphasizing contrast concepts, leading to distinct latent space representations. These activations are projected onto a principal eigenvector and further decomposed using Principal Component Analysis (PCA) to extract key concept directions. The resulting projections enable downstream monitoring and control, facilitating interpretability in model behavior.
  • Figure 2: The cosine similarity of neural activity between adjacent layers. The results demonstrate that neural activity is getting more similar.
  • Figure 3: The eigenvalue magnitude across layers. The results demonstrate the spectral gap is getting smaller.
  • Figure 4: The polar plot demonstrates normalized connections between principal eigenvector and attention output, where the number indicates their cosine similarity. The results showcase that the attention output, especially in later layers is very similar to the attention matrix's principal eigenvector of that layer.
  • Figure 5: VLM response to an image of the Golden Gate Bridge with a prompt related to the concept of honesty, along with token-wise honesty scores. Green indicates high honesty, while red represents low honesty.
  • ...and 5 more figures