Table of Contents
Fetching ...

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Sampriti Soor, Suklav Ghosh, Arijit Sur

TL;DR

The paper addresses the vulnerability of language models to prompt-based manipulation by proposing universal adversarial suffixes. It introduces a calibrating, differentiable suffix learned via Gumbel-Softmax that is restricted by a forbid mask and trained with Calibrated Cross-Entropy, entropy regularization, and a fluency penalty to ensure transferability. Across five NLP tasks and three model families, the learned suffixes consistently degrade accuracy and calibrated confidence, and transfer effectively to unseen models, especially in zero-shot settings. This work advances understanding of prompt-based weaknesses and offers a robust framework for studying and defending against universal adversarial prompts.

Abstract

Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

TL;DR

The paper addresses the vulnerability of language models to prompt-based manipulation by proposing universal adversarial suffixes. It introduces a calibrating, differentiable suffix learned via Gumbel-Softmax that is restricted by a forbid mask and trained with Calibrated Cross-Entropy, entropy regularization, and a fluency penalty to ensure transferability. Across five NLP tasks and three model families, the learned suffixes consistently degrade accuracy and calibrated confidence, and transfer effectively to unseen models, especially in zero-shot settings. This work advances understanding of prompt-based weaknesses and offers a robust framework for studying and defending against universal adversarial prompts.

Abstract

Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.

Paper Structure

This paper contains 15 sections, 18 equations, 5 tables, 1 algorithm.