Correcting Large Language Model Behavior via Influence Function

Han Zhang; Zhuo Zhang; Yi Zhang; Yuanzhao Zhai; Hanyang Peng; Yu Lei; Yue Yu; Hui Wang; Bin Liang; Lin Gui; Ruifeng Xu

Correcting Large Language Model Behavior via Influence Function

Han Zhang, Zhuo Zhang, Yi Zhang, Yuanzhao Zhai, Hanyang Peng, Yu Lei, Yue Yu, Hui Wang, Bin Liang, Lin Gui, Ruifeng Xu

TL;DR

This work tackles the problem of dynamic, evolving human preferences causing outdated training data to misalign LLM behavior. It introduces LANCET, a two-stage framework that first uses LinFAC to identify influential contaminated data via efficient influence-function recall, and then applies Influence-driven Bregman Optimization (IBO) to adjust the model using influence rankings in a post-training setting. Empirically, LANCET reduces unsafe outputs while preserving or even improving usefulness and diversity, outperforming baselines including human-corrected and unlearning methods, and showing strong generalization to unseen prompts. The approach is modular, plug-and-play, and interpretable, offering a practical path to maintaining alignment with evolving human norms without costly human intervention.

Abstract

Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, whether they involve the curation of new data for continual alignment or the manual correction of outdated data for re-alignment, demand costly human resources. To address this challenge, we propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET), which requires no human involvement. LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an Influence function-driven Bregman Optimization (IBO) technique to adjust the model's behavior based on these influence distributions. Our experiments demonstrate that LANCET effectively and efficiently correct inappropriate behaviors of LLMs. Furthermore, LANCET can outperform methods that rely on collecting human preferences, and it enhances the interpretability of learning human preferences within LLMs.

Correcting Large Language Model Behavior via Influence Function

TL;DR

Abstract

Correcting Large Language Model Behavior via Influence Function

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)