Table of Contents
Fetching ...

BeamClean: Language Aware Embedding Reconstruction

Kaan Kale, Kyle Mylonakis, Jay Roberts, Sidhartha Roy

TL;DR

BeamClean addresses the privacy risk of obfuscated input embeddings in model-as-a-service by formulating a blind inversion attack that observes $y_{1:T}$ and uses the embedding table $\mathcal{X}$ without access to the obfuscation or model. It jointly estimates the noise model parameters $\theta$ and decodes the original token sequence via a beam-search procedure that consumes a language-model prior $p_{LM}$, solving $\hat{\theta}=\arg\max_{\theta}\log p(\theta|y_{1:T})$ and retrieving $\hat{w}_{1:T}=\arg\max_{w_{1:T}}\log\pi_{\hat{\theta}}(y_{1:T}|x(w_{1:T}))\,p_{LM}(w_{1:T})$. Empirically, BeamClean outperforms naive nearest-neighbor baselines under Gaussian and Laplacian noise, achieving substantial token-recovery and PII leakage improvements (e.g., $74.3\%$ vs $42.1\%$ token recovery on MRPC with Gaussian noise at $\epsilon=15$ and $60\%$ PII recovery on PAPILLON at $\epsilon=8.5$), and remains effective even when the decoding prior differs from the target LM. These results underscore the vulnerability of input-independent noise mechanisms and motivate stronger, input-aware privacy defenses for embedding-based inferences in MaaS environments.

Abstract

In this work, we consider an inversion attack on the obfuscated input embeddings sent to a language model on a server, where the adversary has no access to the language model or the obfuscation mechanism and sees only the obfuscated embeddings along with the model's embedding table. We propose BeamClean, an inversion attack that jointly estimates the noise parameters and decodes token sequences by integrating a language-model prior. Against Laplacian and Gaussian obfuscation mechanisms, BeamClean always surpasses naive distance-based attacks. This work highlights the necessity for and robustness of more advanced learned, input-dependent methods.

BeamClean: Language Aware Embedding Reconstruction

TL;DR

BeamClean addresses the privacy risk of obfuscated input embeddings in model-as-a-service by formulating a blind inversion attack that observes and uses the embedding table without access to the obfuscation or model. It jointly estimates the noise model parameters and decodes the original token sequence via a beam-search procedure that consumes a language-model prior , solving and retrieving . Empirically, BeamClean outperforms naive nearest-neighbor baselines under Gaussian and Laplacian noise, achieving substantial token-recovery and PII leakage improvements (e.g., vs token recovery on MRPC with Gaussian noise at and PII recovery on PAPILLON at ), and remains effective even when the decoding prior differs from the target LM. These results underscore the vulnerability of input-independent noise mechanisms and motivate stronger, input-aware privacy defenses for embedding-based inferences in MaaS environments.

Abstract

In this work, we consider an inversion attack on the obfuscated input embeddings sent to a language model on a server, where the adversary has no access to the language model or the obfuscation mechanism and sees only the obfuscated embeddings along with the model's embedding table. We propose BeamClean, an inversion attack that jointly estimates the noise parameters and decodes token sequences by integrating a language-model prior. Against Laplacian and Gaussian obfuscation mechanisms, BeamClean always surpasses naive distance-based attacks. This work highlights the necessity for and robustness of more advanced learned, input-dependent methods.

Paper Structure

This paper contains 23 sections, 14 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the generic input-embedding obfuscation pipeline and our adversarial threat model. Plaintext inputs are first encoded and transformed into noisy (i.e. obfuscated) embeddings, which are then transmitted to the LLM provider. An attacker accesses the noisy embeddings in order to attempt to recover the original plaintext input. Within the local trust zone, an obfuscation mechanism is applied to embeddings for a target LLM. These noisy embeddings are inputs to the BeamClean algorithm. The noisy embeddings are then put through a scoring algorithm to determine the top-k candidate token-ids. These top candidates are added to candidates from previously scored tokens to form beams. The top scoring beams are selected and used to start the scoring algorithm for the next token in the sequence.
  • Figure 2: BeamClean is an iterative algorithm that begins with clean candidate tokens mapping to their corresponding embeddings. These clean candidate embeddings and noisy embeddings are inputs to a surrogate noise model of the obfuscation mechanism, $\pi_{\hat{\theta}^{}}$. The clean candidate tokens are also used to produce a language prior (optionally translating tokens for the case of distinct target and prior language models). Together, the language prior and the surrogate noise model produce a likelihood score which is used to train the surrogate model. This is done iteratively to update the beam candidates. Finally, the highest scoring beam is selected as the reconstruction.
  • Figure 3: Performance of BeamClean compared to Nearest Neighbor on the MRPC dataset. Curves show token-recovery rate as a function of $\epsilon$ with beam size 20. We compare against Gaussian, \ref{['fig:comparison-mrpc-g']}, and Laplacian, \ref{['fig:comparison-mrpc-l']}, noise mechanisms using, respectively. In both cases BeamClean outperforms Nearest Neighbor. Against Gaussian noise at $\epsilon=15$ our attack recovers 74.3% of tokens versus 42.1% for Nearest Neighbor. Against Laplacian noise at $\epsilon=8.5$ the attack attains 86% recovery versus 18% for Nearest Neighbor.
  • Figure 4: Mean PII Recovery percent on PAPILLON vs Laplace noise mechanism DP-$\epsilon$. BeamClean consistently able to recover more PII strings than Nearest Neighbor, recovering 60.0% of PII strings compared to 1.9% recovered by Nearest Neighbor at $\epsilon=8.5$.
  • Figure 5: Attacking obfuscated GPT-2 embeddings using a Llama-3.2-1B-Instruct model as a language prior. BeamClean uniformly outperforms Nearest Neighbors, with the largest measured difference between the reconstruction methods occurring at DP $\epsilon \approx 9.6$, BeamClean achieves 77% recovery versus 17% for the baseline.