Table of Contents
Fetching ...

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

Yan Scholten, Stephan Günnemann, Leo Schwinn

TL;DR

This work argues that deterministic, greedy evaluations inadequately capture the full output distribution and associated risks of large language models, particularly for unlearning and alignment. It introduces a formal probabilistic evaluation framework with high-probability leakage guarantees, comprising distribution-bound metrics and moment-based bounds, plus an ED development score. The paper then presents entropy-regularized unlearning and adaptive temperature scaling to reduce leakage when sampling from the model, validated on TOFU and Harry Potter datasets, and demonstrates that probabilistic evaluations reveal leakage that deterministic methods miss. It further shows that probabilistic assessments apply beyond unlearning to alignment, revealing practical risks such as higher toxicity under sampling. Overall, the approach provides a robust, distribution-focused toolkit for evaluating and improving LLM safety and reliability across sensitive applications, with broad potential for extension to other modalities.

Abstract

Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework for LLMs. Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Our experimental analysis reveals that deterministic evaluations falsely indicate successful unlearning and alignment, whereas our probabilistic evaluations better capture model capabilities. We show how to overcome challenges associated with probabilistic outputs in a case study on unlearning by introducing (1) a novel loss based on entropy optimization, and (2) adaptive temperature scaling. We demonstrate that our approach significantly enhances unlearning in probabilistic settings on recent benchmarks. Overall, our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://www.cs.cit.tum.de/daml/probabilistic-unlearning/.

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

TL;DR

This work argues that deterministic, greedy evaluations inadequately capture the full output distribution and associated risks of large language models, particularly for unlearning and alignment. It introduces a formal probabilistic evaluation framework with high-probability leakage guarantees, comprising distribution-bound metrics and moment-based bounds, plus an ED development score. The paper then presents entropy-regularized unlearning and adaptive temperature scaling to reduce leakage when sampling from the model, validated on TOFU and Harry Potter datasets, and demonstrates that probabilistic evaluations reveal leakage that deterministic methods miss. It further shows that probabilistic assessments apply beyond unlearning to alignment, revealing practical risks such as higher toxicity under sampling. Overall, the approach provides a robust, distribution-focused toolkit for evaluating and improving LLM safety and reliability across sensitive applications, with broad potential for extension to other modalities.

Abstract

Comprehensive evaluation of Large Language Models (LLMs) is an open research problem. Existing evaluations rely on deterministic point estimates generated via greedy decoding. However, we find that deterministic evaluations fail to capture the whole output distribution of a model, yielding inaccurate estimations of model capabilities. This is particularly problematic in critical contexts such as unlearning and alignment, where precise model evaluations are crucial. To remedy this, we introduce the first formal probabilistic evaluation framework for LLMs. Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment. Our experimental analysis reveals that deterministic evaluations falsely indicate successful unlearning and alignment, whereas our probabilistic evaluations better capture model capabilities. We show how to overcome challenges associated with probabilistic outputs in a case study on unlearning by introducing (1) a novel loss based on entropy optimization, and (2) adaptive temperature scaling. We demonstrate that our approach significantly enhances unlearning in probabilistic settings on recent benchmarks. Overall, our proposed shift from point estimates to probabilistic evaluations of output distributions represents an important step toward comprehensive evaluations of LLMs. Code available at https://www.cs.cit.tum.de/daml/probabilistic-unlearning/.
Paper Structure (19 sections, 1 theorem, 14 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 1 theorem, 14 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

With high probability of at least $1-\alpha$, metric $M_2(x)$ upper-bounds the probability that the next sample leaks more than x% of the secret, $\Pr(X>x) \leq M_2(x)$ for all $x \in [0,1]$.

Figures (6)

  • Figure 1: We propose a novel probabilistic evaluation framework as a more reliable method for assessing LLM capabilities. Existing evaluations are deterministic and rely on greedy decoding, where the most likely token is selected at each step, producing only a single output per query. Since in most practical applications LLMs generate outputs probabilistically, previous evaluation schemes are insufficient: They overlook potential information leaks and falsely suggest successful unlearning. In contrast, in our probabilistic evaluation framework we directly consider the LLM's output distribution by sampling from the token probability distribution at each step to generate multiple sequences. In an empirical study, we show that all state-of-the-art unlearning methods leak information under our probabilistic setting, demonstrating that current deterministic evaluations are insufficient.
  • Figure 2: Entropy optimization: In this example the model (1) must unlearn the answer to the question "Who are Harry Potter's best friends?" while retaining the answer to the question "What is the capital of Canada?". While minimizing the unlearning loss (2) ensures that the model forgets the sensitive information, our method minimizes the entropy of the model's output distribution for forget samples (3) and retains it on retain samples (4). This allows us to selectively reduce entropy for unlearning-related queries while maintaining entropy on retain samples, effectively reducing the risk of leaking sensitive information under sampling attacks without compromising diversity.
  • Figure 3: Our results demonstrate that deterministic evaluations fail to detect residual information still contained after unlearning, whereas our probabilistic metrics provide more comprehensive evaluations: (a) Binary leakage bound (M$_\mathbf{bin}$) for questions of the Harry Potter Q&A. While greedy decoding indicates successful unlearning, our probabilistic perspective reveals that for 38% of the questions the upper bound on the expected leakage is larger than 10%. (b-c) ROUGE-L score of $1024$ generated responses from a single question of the TOFU dataset. The bold dashed line indicates the ROUGE-L score of greedy decoding. The second row contains results for NPO and our proposed unlearning algorithm for a question-answer pair of the TOFU forget set. (d) General leakage bound (M$_\mathbf{gen}$) illustrating differences in information leakage between NPO and our approach for different levels of leakage $x$. (e-f) Expectation bound (M$_\mathbf{\mu}$) and standard deviation bound (M$_\mathbf{\sigma}$).
  • Figure 4: (a) Effect of forget entropy regularization weight $\lambda_f$ on the standard deviation of the leakage distribution. Stronger regularization decreases the probability of leaking information. (b) Decreasing temperature $\tau$ also decreases model leakage, but also results in lower output diversity.
  • Figure 5: Ablation studies for our proposed entropy optimization approach: (a) Negative effects on output diversity can be mitigated through a negatively weighted ($\lambda_r$) entropy loss. (b) Token confidence on the forget set considerably increases during training, remaining largely the same on the retain set. This allows us to decrease information leakage while maintaining output diversity for unrelated tasks. (c) Models trained with random entropy regularization parameters. We observe no relation between the magnitude of regularization and model utility in our experiments.
  • ...and 1 more figures

Theorems & Definitions (5)

  • proof
  • Proposition 1
  • proof
  • proof
  • proof