On the Robustness of Knowledge Editing for Detoxification
Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He
TL;DR
This work questions the reliability of toxicity-classifier-centric evaluations for knowledge-editing–based detoxification of LLMs. It introduces a robustness-oriented framework across three dimensions—optimisation, compositional, and cross-lingual—and develops degeneration-aware evaluation to detect pseudo-detoxification. Through DINM and FT-M edits applied to multiple LLMs and a multilingual mSafeEdit dataset, the study reveals that detoxification is highly sensitive to hyperparameters, struggles under joint edits, and often fails to generalize across languages. The findings emphasize degeneration-aware assessment and caution against over-reliance on surface-level toxicity reductions, motivating more robust, language-inclusive detoxification strategies with stable behavior change. The work provides practical guidance for evaluating safety interventions and highlights the need for methods that ensure real, transferable behavioural suppression rather than artefacts of optimisation.
Abstract
Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.
