Table of Contents
Fetching ...

SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation

Saransh Agrawal, Kuan-Hao Huang

TL;DR

We tackle selective unlearning in large language models by locating the causal loci of memorized facts with causal mediation analysis (CMA) and applying constrained updates to early transformer layers, specifically the MLPs in layers 0–5. The unlearning objective uses a joint loss $\mathcal{L}_{\text{joint}} = -\mathcal{L}_{\text{CE}}^{\text{forget}} + \alpha \cdot \mathcal{L}_{\text{CE}}^{\text{retain}}$ with adaptive $\alpha$ guided by $\gamma = a \cdot b^{\Delta L} + c$ clipped to $[\alpha_{\min}, \alpha_{\max}]$ (values: $a=0.3$, $b=6$, $c=0.8$, $\alpha_{\min}=1.2$, $\alpha_{\max}=2.8$). Experiments on OLMo 1B and 7B show that early layers store subject-attribute associations, enabling effective forgetting with limited impact on general ability: 1B achieves final score $0.652$ (TA $0.973$) with forget-set at $0.14$, retain-set at $0.94$, and MIA $0.741$; MMLU drops by $\sim22\%$ to $0.24$. The 7B model attains TA $0.964$ but suffers a larger $46\%$ MMLU degradation, highlighting scalability considerations. These results validate causal-informed layer isolation as a practical route to efficient, targeted unlearning while preserving essential capabilities, advancing privacy compliance without full retraining.

Abstract

Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.

SHA256 at SemEval-2025 Task 4: Selective Amnesia -- Constrained Unlearning for Large Language Models via Knowledge Isolation

TL;DR

We tackle selective unlearning in large language models by locating the causal loci of memorized facts with causal mediation analysis (CMA) and applying constrained updates to early transformer layers, specifically the MLPs in layers 0–5. The unlearning objective uses a joint loss with adaptive guided by clipped to (values: , , , , ). Experiments on OLMo 1B and 7B show that early layers store subject-attribute associations, enabling effective forgetting with limited impact on general ability: 1B achieves final score (TA ) with forget-set at , retain-set at , and MIA ; MMLU drops by to . The 7B model attains TA but suffers a larger MMLU degradation, highlighting scalability considerations. These results validate causal-informed layer isolation as a practical route to efficient, targeted unlearning while preserving essential capabilities, advancing privacy compliance without full retraining.

Abstract

Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.

Paper Structure

This paper contains 15 sections, 7 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Impact of restoring hidden states at various token levels on predicting correct attribute. Top: OLMo 7B. Bottom: OLMo 1B. The x-axis shows number of layers and y-axis shows the impact of each type of token (i,s,r) in predicting attribute 'a'. The 's' and 'r' are broken into first 'f', middle 'm' and last 'l' tokens. Tokens in same category is averaged.
  • Figure 2: Visualization of the adaptive regularization function $\alpha$, plotted against the change in retain loss $\Delta L$. The chosen configuration (solid red) strongly penalizes increases in $\Delta L$ for the range of observed $\Delta L$ values, compared to blue or green.