Table of Contents
Fetching ...

DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

Ben Ganon, Alon Zolfi, Omer Hofman, Inderjeet Singh, Hisashi Kojima, Yuval Elovici, Asaf Shabtai

TL;DR

DIESEL addresses the challenge of safely deploying autoregressive LLMs without expensive retraining by introducing a lightweight, inference-time mechanism that filters undesired outputs through latent-space semantic similarity to user-defined negative concepts. The method uses a three-step decoding pipeline—candidate selection, latent-space safety scoring, and token reranking—to steer generation while preserving fluency and substantially reducing unsafe responses. Evaluations across multiple state-of-the-art models, jailbreaking attacks, and multilingual settings demonstrate strong safety improvements with minimal utility loss and modest runtime overhead, outperforming several existing defenses. The approach is generalizable beyond safety to other content-m filtering tasks, and its reliance on textual descriptions of negative concepts enables flexible, user-friendly safety control in dynamic settings.

Abstract

In recent years, large language models (LLMs) have had great success in tasks such as casual conversation, contributing to significant advancements in domains like virtual assistance. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models, even in adversarial jailbreaking scenarios that challenge response safety. We also highlight DIESEL's generalization capabilities, showing that it can be used in use cases other than safety, providing general-purpose response filtering.

DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

TL;DR

DIESEL addresses the challenge of safely deploying autoregressive LLMs without expensive retraining by introducing a lightweight, inference-time mechanism that filters undesired outputs through latent-space semantic similarity to user-defined negative concepts. The method uses a three-step decoding pipeline—candidate selection, latent-space safety scoring, and token reranking—to steer generation while preserving fluency and substantially reducing unsafe responses. Evaluations across multiple state-of-the-art models, jailbreaking attacks, and multilingual settings demonstrate strong safety improvements with minimal utility loss and modest runtime overhead, outperforming several existing defenses. The approach is generalizable beyond safety to other content-m filtering tasks, and its reliance on textual descriptions of negative concepts enables flexible, user-friendly safety control in dynamic settings.

Abstract

In recent years, large language models (LLMs) have had great success in tasks such as casual conversation, contributing to significant advancements in domains like virtual assistance. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models, even in adversarial jailbreaking scenarios that challenge response safety. We also highlight DIESEL's generalization capabilities, showing that it can be used in use cases other than safety, providing general-purpose response filtering.

Paper Structure

This paper contains 40 sections, 5 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example of a prompt with an adversarial suffix (jailbreak), with the responses of vanilla autoregressive inference and DIESEL.
  • Figure 2: Overview of DIESEL's response generation pipeline: (1) Generate next-token probabilities using the base model $f_{\theta_1}$. (2) Select the top-k candidate tokens from $V_k$ based on probability. (3) Compute embeddings for each candidate token, appended to the previously generated response, using a lightweight sentence model $f_{\theta_2}$, with negative-concept embeddings precomputed. (4) Evaluate token safety scores $\gamma(\cdot)$ (Equation \ref{['eq:gamma']}) and rerank using Equation \ref{['eq:rerank']}. (5) Choose the highest-scoring token, append it to the response, and repeat until the stop condition is met (EOS token or length limit).
  • Figure 3: Defense success rate for various defenses applied to uncensored models using the BeaverTails dataset across the five most prevalent safety categories. $\text{DIESEL}_\text{max}$ refers to DIESEL with the maximum cutoff value ($\tau=0.8$), which maintains high utility on benign prompts while halting token generation entirely for unsafe completions.
  • Figure 4: ASR across prompts in different languages on the Multilingual Aya Red-Teaming dataset. Prompts from the same language are evaluated under No Defense, DIESEL (negative concepts of same language), and DIESEL (negative concepts in English), highlighting DIESEL’s multilingual generalizability.
  • Figure 5: Ablation study on DIESEL hyperparameters ($\alpha$, $k$, and $\tau$). We report ASR on the AutoDAN attack and average benchmark scores (MMLU, SQuAD, and TruthfulQA).