Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Yihao Zhang; Zeming Wei; Jun Sun; Meng Sun

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun

TL;DR

This work first identifies the importance of a robust and reliable sensor during editing, then proposes an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance.

Abstract

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering.

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 6 equations, 3 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Notations and Problem Formulation
Proposed Method
Adversarial Representation Engineering
General Conceptual Editing
Experiments
Alignment: To Generate (harmful responses) or not to generate
Hallucination: To Hallucinate or Not to Hallucinate
Text Generation Quality Issues
Discussion and Conclusion
Experiment details
Details of training the discriminator
Details of hallucination evaluation
Jailbreak Evaluation Results on Larger Models
...and 1 more sections

Figures (3)

Figure 1: An illustration of our proposed ARE framework. This example showcases how ARE can enhance the concept of "angry" within an LLM. The process involves an iterative dance between the generator and the discriminator. The generator produces outputs, while the discriminator refines its internal representation of "angry" based on these outputs. Through this back-and-forth training, the LLM gradually learns to produce outputs that align better with the concept of "angry."
Figure 2: Comparison between the basic structures of GAN and ARE.
Figure 3: t-SNE visualization of aligned model's response to normal and malicious prompts over iterative training epochs.

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

TL;DR

Abstract

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)