Table of Contents
Fetching ...

"Why" Has the Least Side Effect on Model Editing

Tsung-Hsuan Pan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

TL;DR

The paper tackles the problem of updating LLM knowledge efficiently through model editing while mitigating unintended side effects. It analyzes how question-type categories, batch size, and model size influence degradation after edits, using MEMIT on RealTimeQA with GPT-2-XL and LLaMA-2-7B. Key findings show that 'Why' questions cause the least degradation, larger batch sizes can delay the second performance drop, and insights from small models do not always transfer to larger ones. These results inform experimental design for knowledge editing and caution against overgeneralizing across model scales, highlighting areas for further mechanism-focused research.

Abstract

Training large language models (LLMs) from scratch is an expensive endeavor, particularly as world knowledge continually evolves. To maintain relevance and accuracy of LLMs, model editing has emerged as a pivotal research area. While these methods hold promise, they can also produce unintended side effects. Their underlying factors and causes remain largely unexplored. This paper delves into a critical factor-question type-by categorizing model editing questions. Our findings reveal that the extent of performance degradation varies significantly across different question types, providing new insights for experimental design in knowledge editing. Furthermore, we investigate whether insights from smaller models can be extrapolated to larger models. Our results indicate discrepancies in findings between models of different sizes, suggesting that insights from smaller models may not necessarily apply to larger models. Additionally, we examine the impact of batch size on side effects, discovering that increasing the batch size can mitigate performance drops.

"Why" Has the Least Side Effect on Model Editing

TL;DR

The paper tackles the problem of updating LLM knowledge efficiently through model editing while mitigating unintended side effects. It analyzes how question-type categories, batch size, and model size influence degradation after edits, using MEMIT on RealTimeQA with GPT-2-XL and LLaMA-2-7B. Key findings show that 'Why' questions cause the least degradation, larger batch sizes can delay the second performance drop, and insights from small models do not always transfer to larger ones. These results inform experimental design for knowledge editing and caution against overgeneralizing across model scales, highlighting areas for further mechanism-focused research.

Abstract

Training large language models (LLMs) from scratch is an expensive endeavor, particularly as world knowledge continually evolves. To maintain relevance and accuracy of LLMs, model editing has emerged as a pivotal research area. While these methods hold promise, they can also produce unintended side effects. Their underlying factors and causes remain largely unexplored. This paper delves into a critical factor-question type-by categorizing model editing questions. Our findings reveal that the extent of performance degradation varies significantly across different question types, providing new insights for experimental design in knowledge editing. Furthermore, we investigate whether insights from smaller models can be extrapolated to larger models. Our results indicate discrepancies in findings between models of different sizes, suggesting that insights from smaller models may not necessarily apply to larger models. Additionally, we examine the impact of batch size on side effects, discovering that increasing the batch size can mitigate performance drops.
Paper Structure (11 sections, 2 figures)

This paper contains 11 sections, 2 figures.

Figures (2)

  • Figure 1: Results of LLaMA-2. Please note that the scale of the y-axis in different charts differs for the detailed discussions.
  • Figure 2: Results of GPT-2.