Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation
Xinguo Feng, Zhongkui Ma, Zihan Wang, Alsharif Abuadbba, Guangdong Bai
TL;DR
The paper tackles privacy risks in collaborative language-model training by gradient inversion attacks that recover private data from shared gradients. It introduces GHOST, a token-level defense that obfuscates original tokens via shadow tokens while preserving embedding- and gradient-space utility, leveraging a two-stage process: searching for semantically distinct but embedding-proximate shadows and selecting optimal shadows to minimize disruptions to model outputs. The authors provide a formal analysis showing that utility loss scales gently $L(\mathbf{x}; \tilde{\boldsymbol{\theta}}) - L(\tilde{\mathbf{x}}; \tilde{\boldsymbol{\theta}}) = O(\epsilon)$ while gradient leakage is suppressed with $\| \nabla_{\tilde{\boldsymbol{\theta}}} L(\mathbf{x}; \tilde{\boldsymbol{\theta}}) - \nabla_{\tilde{\boldsymbol{\theta}}} L(\tilde{\mathbf{x}}; \tilde{\boldsymbol{\theta}}) \| = O(1)$ as $\epsilon \to 0$. Empirically, Ghost achieves strong privacy protection (token-recovery rates near 1-2%) and preserves utility across classification and generation tasks (e.g., classification F1 up to 0.92; perplexity down to 5.45) across diverse models (BERT, Llama, Gemma) and datasets, including resilience to adaptive GIAs. Comparisons with gradient-noise and gradient-pruning baselines show Ghost delivers the best privacy-utility balance, underscoring the value of token-level obfuscation. The work suggests a paradigm shift from gradient-centric defenses to space-decoupled token-level strategies for privacy in collaborative learning with large language models.
Abstract
Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.
