Table of Contents
Fetching ...

Computational Safety for Generative AI: A Signal Processing Perspective

Pin-Yu Chen

TL;DR

This paper reframes AI safety for GenAI through a signal-processing lens, defining computational safety as hypothesis testing with judge functions to certify safe inputs and outputs. It formalizes two concrete use cases—jailbreak prompt detection (model input) and AI-generated content detection (model output)—and demonstrates how sensitivity analysis, loss-landscape analysis, subspace modeling, and adversarial learning yield effective detectors and mitigations. Key contributions include a unified framework that recasts safety challenges as detection tasks, concrete methods like Gradient Cuff and Token Highlighter for jailbreak defense, and training-free detectors like AEROBLADE and RIGID for AI-generated image detection, plus RADAR for robust AI-generated text detection. The work argues for the essential role of signal processing in practical AI safety, highlights open challenges, and envisions pursuing AI safety as a collaborative public-good effort toward Artificial Good Intelligence with substantial real-world impact.

Abstract

AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of creating realistic and high-quality content through text prompts. Examples of such tools include large language models (LLMs) and text-to-image (T2I) diffusion models. As the performance of various leading GenAI models approaches saturation due to similar training data sources and neural network architecture designs, the development of reliable safety guardrails has become a key differentiator for responsibility and sustainability. This paper presents a formalization of the concept of computational safety, which is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI through the lens of signal processing theory and methods. In particular, we explore two exemplary categories of computational safety challenges in GenAI that can be formulated as hypothesis testing problems. For the safety of model input, we show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. For the safety of model output, we elucidate how statistical signal processing and adversarial learning can be used to detect AI-generated content. Finally, we discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.

Computational Safety for Generative AI: A Signal Processing Perspective

TL;DR

This paper reframes AI safety for GenAI through a signal-processing lens, defining computational safety as hypothesis testing with judge functions to certify safe inputs and outputs. It formalizes two concrete use cases—jailbreak prompt detection (model input) and AI-generated content detection (model output)—and demonstrates how sensitivity analysis, loss-landscape analysis, subspace modeling, and adversarial learning yield effective detectors and mitigations. Key contributions include a unified framework that recasts safety challenges as detection tasks, concrete methods like Gradient Cuff and Token Highlighter for jailbreak defense, and training-free detectors like AEROBLADE and RIGID for AI-generated image detection, plus RADAR for robust AI-generated text detection. The work argues for the essential role of signal processing in practical AI safety, highlights open challenges, and envisions pursuing AI safety as a collaborative public-good effort toward Artificial Good Intelligence with substantial real-world impact.

Abstract

AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of creating realistic and high-quality content through text prompts. Examples of such tools include large language models (LLMs) and text-to-image (T2I) diffusion models. As the performance of various leading GenAI models approaches saturation due to similar training data sources and neural network architecture designs, the development of reliable safety guardrails has become a key differentiator for responsibility and sustainability. This paper presents a formalization of the concept of computational safety, which is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI through the lens of signal processing theory and methods. In particular, we explore two exemplary categories of computational safety challenges in GenAI that can be formulated as hypothesis testing problems. For the safety of model input, we show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. For the safety of model output, we elucidate how statistical signal processing and adversarial learning can be used to detect AI-generated content. Finally, we discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.

Paper Structure

This paper contains 18 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the framework on signal processing for computational AI safety. The two safety challenges highlighted for model input and output safety in GenAI are the detection of unsafe queries and AI-generated content. We define computational AI safety as the set of safety problems that can be formulated as a hypothesis testing task in signal processing. We also provide examples of how signal processing techniques such as sensitivity analysis, subspace projection, and adversarial learning can be used to improve AI safety. What is unique about GenAI is that the validity of the safety hypothesis (e.g. whether the input or output is safe) requires an additional judge function for certification, which can be either a rule-based approach (e.g. keyword matching) or an AI-based evaluation (e.g. LLM-as-a-judge or external contextual classifiers).
  • Figure 2: Loss landscape analysis for benign and jailbreak prompts. We use the token embeddings of Vicuna-7B-V1.5 to compute the non-refusal rate of model responses generated from perturbed input embeddings, by interpolating two random directions with additive Gaussian noise in the token embedding space, where the perturbation strengths are denoted by $\alpha$ and $\beta$. The results are averaged over 100 prompts. The benign prompts are sampled from Chatbot Arena, and the jailbreak prompts are generated by the greedy coordinate gradient (GCG) attack zou2023universal. The analysis shows that the jailbreak prompts are more sensitive to Gaussian perturbations than the benign prompts.
  • Figure 3: Comparison of jailbreak prompt detection and mitigation methods. (a) Safety-capability trade-offs. The safety performance is evaluated by the attack success rate (ASR) averaged over 6 jailbreak attacks, and the capability performance is evaluated by the win rate in Alpaca Eval alpaca_eval. A higher win rate and lower ASR means a better approach. See the "Performance Evaluation" paragraph for details). (b) Per-query run time analysis (seconds). Overall, Token Highlighter is the most economical method that best balances the safety-capability trade-off with lightweight compuation cost.
  • Figure 4: Loss landscape analysis for real and AI-generated images. We use the embedding of DINOV2 oquab2023dinov2 to compute the cosine similarity between an original image and its perturbed version by interpolating two random directions with additive Gaussian noise in pixel space, where the perturbation strengths are denoted by $\alpha$ and $\beta$. The results are averaged over 100 images. The real images are sampled from ImageNet, while the AI-generated ones are generated by the ablated diffusion mode (ADM) dhariwal2021diffusion. The cosine similarity analysis shows that AI-generated images are more sensitive to Gaussian perturbations than real images.