Table of Contents
Fetching ...

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun

TL;DR

This work first identifies the importance of a robust and reliable sensor during editing, then proposes an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance.

Abstract

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

TL;DR

This work first identifies the importance of a robust and reliable sensor during editing, then proposes an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance.

Abstract

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.
Paper Structure (16 sections, 6 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 6 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: An illustration of our proposed ARE framework. This example showcases how ARE can enhance the concept of "angry" within an LLM. The process involves an iterative dance between the generator and the discriminator. The generator produces outputs, while the discriminator refines its internal representation of "angry" based on these outputs. Through this back-and-forth training, the LLM gradually learns to produce outputs that align better with the concept of "angry."
  • Figure 2: Comparison between the basic structures of GAN and ARE.
  • Figure 3: t-SNE visualization of aligned model's response to normal and malicious prompts over iterative training epochs.