Table of Contents
Fetching ...

On the Robustness of Knowledge Editing for Detoxification

Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He

TL;DR

This work questions the reliability of toxicity-classifier-centric evaluations for knowledge-editing–based detoxification of LLMs. It introduces a robustness-oriented framework across three dimensions—optimisation, compositional, and cross-lingual—and develops degeneration-aware evaluation to detect pseudo-detoxification. Through DINM and FT-M edits applied to multiple LLMs and a multilingual mSafeEdit dataset, the study reveals that detoxification is highly sensitive to hyperparameters, struggles under joint edits, and often fails to generalize across languages. The findings emphasize degeneration-aware assessment and caution against over-reliance on surface-level toxicity reductions, motivating more robust, language-inclusive detoxification strategies with stable behavior change. The work provides practical guidance for evaluating safety interventions and highlights the need for methods that ensure real, transferable behavioural suppression rather than artefacts of optimisation.

Abstract

Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.

On the Robustness of Knowledge Editing for Detoxification

TL;DR

This work questions the reliability of toxicity-classifier-centric evaluations for knowledge-editing–based detoxification of LLMs. It introduces a robustness-oriented framework across three dimensions—optimisation, compositional, and cross-lingual—and develops degeneration-aware evaluation to detect pseudo-detoxification. Through DINM and FT-M edits applied to multiple LLMs and a multilingual mSafeEdit dataset, the study reveals that detoxification is highly sensitive to hyperparameters, struggles under joint edits, and often fails to generalize across languages. The findings emphasize degeneration-aware assessment and caution against over-reliance on surface-level toxicity reductions, motivating more robust, language-inclusive detoxification strategies with stable behavior change. The work provides practical guidance for evaluating safety interventions and highlights the need for methods that ensure real, transferable behavioural suppression rather than artefacts of optimisation.

Abstract

Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.
Paper Structure (37 sections, 24 figures, 3 tables)

This paper contains 37 sections, 24 figures, 3 tables.

Figures (24)

  • Figure 1: Number of unsafe responses and repetitions among 50 test items for LLMs detoxified using DINM. Detoxification is performed with a learning rate of $5\times10^{-4}$ and 10 editing steps.
  • Figure 2: Results of Mistral-7B Edited using DINM: The number of unsafe responses with respect to different editing steps (left); The number of repetitions with respect to different editing steps (middle); The number of unsafe/repetitive responses with respect to different learning rates at the best editing steps (right).
  • Figure 3: umber of failures (i.e., unsafe responses and repetitive generations) after editing with increasing numbers of unsafe behaviours. 'lr' denotes the learning rate.
  • Figure 4: Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Qwen2-7B.
  • Figure 5: Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Qwen2-7B, where editing is performed in English and evaluation is conducted in other languages.
  • ...and 19 more figures