Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
Yihao Huang, Chong Wang, Xiaojun Jia, Qing Guo, Felix Juefei-Xu, Jian Zhang, Geguang Pu, Yang Liu
TL;DR
The paper tackles universal goal hijacking by proposing POUGH, a framework that combines an efficient gradient-based optimization with two semantics-guided prompt-organization strategies. By gradually increasing the number of prompts involved in the loss and organizing prompts via diversity-focused sampling and target-oriented ranking, POUGH achieves high attack success rates with far lower computational cost than prior universal methods like M-GCG. Empirical results across multiple open-source LLMs and ten malicious target types show POUGH attaining ASR above 90% with substantial time savings, and ablations confirm the critical roles of sampling and ranking. The work highlights the practical importance of prompt organization in adversarial prompt construction and suggests directions for future improvements in semantic metrics and cross-model universality defenses.
Abstract
Universal goal hijacking is a kind of prompt injection attack that forces LLMs to return a target malicious response for arbitrary normal user prompts. The previous methods achieve high attack performance while being too cumbersome and time-consuming. Also, they have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To this end, we propose a method called POUGH that incorporates an efficient optimization algorithm and two semantics-guided prompt organization strategies. Specifically, our method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes them. Given the sequentially ranked prompts, our method employs an iterative optimization algorithm to generate a fixed suffix that can concatenate to arbitrary user prompts for universal goal hijacking. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness.
