RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions
Yuansen Zhang, Xiao Wang, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Xuanjing Huang
TL;DR
RoCoIns addresses the vulnerability of LLMs to textual adversarial prompts by replacing natural language instructions with code-style prompts, which are more structured and less ambiguous. The method includes an adversarial context approach that blends clean and adversarial in-context demonstrations, effectively performing implicit adversarial training during in-context learning. Empirical results on AdvGLUE and Restaurant-T across GPT-3.5-series models show consistent robustness gains, with notable reductions in Attack Success Rate and improved accuracy, especially under adversarial conditions. The study demonstrates that code-style instruction design, combined with adversarial-context demonstrations, enhances robustness in black-box LLMs and offers practical guidance for prompt design and user-friendliness.
Abstract
Large Language Models (LLMs) have showcased remarkable capabilities in following human instructions. However, recent studies have raised concerns about the robustness of LLMs when prompted with instructions combining textual adversarial samples. In this paper, drawing inspiration from recent works that LLMs are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Through this conversion, we provide LLMs with more precise instructions and strengthen the robustness of LLMs. Moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the LLMs. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions. For example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).
