Table of Contents
Fetching ...

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary

TL;DR

MANATEE reframes safety for LLMs as a density-estimation problem over the benign hidden-state manifold and learns a score-based diffusion model trained on benign representations. At inference, it detects anomalous hidden states via an anomaly score and either refines them toward safe regions through DDPM-based diffusion or refuses execution, all without harmful training data or model edits. The approach yields substantial attack-success-rate reductions across multiple models and jailbreak datasets (e.g., up to 100% on ASA and around 78% average), while preserving utility on benign inputs. Its plug-in, inference-time design provides a lightweight, model-agnostic safety mechanism with cross-model transferability and minimal impact on normal operation.

Abstract

Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

TL;DR

MANATEE reframes safety for LLMs as a density-estimation problem over the benign hidden-state manifold and learns a score-based diffusion model trained on benign representations. At inference, it detects anomalous hidden states via an anomaly score and either refines them toward safe regions through DDPM-based diffusion or refuses execution, all without harmful training data or model edits. The approach yields substantial attack-success-rate reductions across multiple models and jailbreak datasets (e.g., up to 100% on ASA and around 78% average), while preserving utility on benign inputs. Its plug-in, inference-time design provides a lightweight, model-agnostic safety mechanism with cross-model transferability and minimal impact on normal operation.

Abstract

Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
Paper Structure (25 sections, 3 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 25 sections, 3 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Attack Success Rate (ASR) reduction across various models on JBB dataset, before and after diffusion models are applied on finetuned harmful models.
  • Figure 2: Overview of MANATEE’s inference-time detection and steering pipeline. Given a user query, the base LLM produces the final-layer hidden state $h_t$. We standardize this representation and apply forward diffusion to obtain a noisy state at a timestep $t_{\mathrm{check}}$. The denoiser's predicted noise magnitude defines an anomaly score $s(h)$ that measures how far $h_t$ lies from the benign manifold. States with scores above the threshold $\tau$ trigger automatic refusal; otherwise, MANATEE applies a DDPM-based purification step by injecting noise and denoising in latent space to produce a purified hidden state that is passed through the LM head to generate a benign, conditionally steered response.
  • Figure 3: Anomaly score distribution for benign vs. backdoored hidden states.
  • Figure 4: Diffusion model training loss over epochs.