Table of Contents
Fetching ...

RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions

Yuansen Zhang, Xiao Wang, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

RoCoIns addresses the vulnerability of LLMs to textual adversarial prompts by replacing natural language instructions with code-style prompts, which are more structured and less ambiguous. The method includes an adversarial context approach that blends clean and adversarial in-context demonstrations, effectively performing implicit adversarial training during in-context learning. Empirical results on AdvGLUE and Restaurant-T across GPT-3.5-series models show consistent robustness gains, with notable reductions in Attack Success Rate and improved accuracy, especially under adversarial conditions. The study demonstrates that code-style instruction design, combined with adversarial-context demonstrations, enhances robustness in black-box LLMs and offers practical guidance for prompt design and user-friendliness.

Abstract

Large Language Models (LLMs) have showcased remarkable capabilities in following human instructions. However, recent studies have raised concerns about the robustness of LLMs when prompted with instructions combining textual adversarial samples. In this paper, drawing inspiration from recent works that LLMs are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Through this conversion, we provide LLMs with more precise instructions and strengthen the robustness of LLMs. Moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the LLMs. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions. For example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).

RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions

TL;DR

RoCoIns addresses the vulnerability of LLMs to textual adversarial prompts by replacing natural language instructions with code-style prompts, which are more structured and less ambiguous. The method includes an adversarial context approach that blends clean and adversarial in-context demonstrations, effectively performing implicit adversarial training during in-context learning. Empirical results on AdvGLUE and Restaurant-T across GPT-3.5-series models show consistent robustness gains, with notable reductions in Attack Success Rate and improved accuracy, especially under adversarial conditions. The study demonstrates that code-style instruction design, combined with adversarial-context demonstrations, enhances robustness in black-box LLMs and offers practical guidance for prompt design and user-friendliness.

Abstract

Large Language Models (LLMs) have showcased remarkable capabilities in following human instructions. However, recent studies have raised concerns about the robustness of LLMs when prompted with instructions combining textual adversarial samples. In this paper, drawing inspiration from recent works that LLMs are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. Through this conversion, we provide LLMs with more precise instructions and strengthen the robustness of LLMs. Moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the LLMs. Experiments on eight robustness datasets show that our method consistently outperforms prompting LLMs with natural language instructions. For example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).
Paper Structure (52 sections, 3 equations, 5 figures, 11 tables)

This paper contains 52 sections, 3 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: An illustration of prompting LLMs with natural language instructions and code-style instructions for the semantics consistent judgment tasks. The input sample contains a sentence pair. We show a clean sample and an adversarial sample, respectively. This code-style instruction can be applied to arbitrary tasks with task-specific design.
  • Figure 2: Components of code-style instructions. (1) Class definition mainly contains the class name, annotation, initial function and implementation function. (2) In-context demonstrations consist of k (adversarial) samples in the corresponding code style. (3) Task prompt follows the same format as demonstrations without a ground truth label.
  • Figure 3: Perplexity for AdvGLUE dataset on T5-base with natural language instructions and CodeT5-base with both natural language and code-style instructions. We report the logarithm of their initial values.
  • Figure 4: Accuracy with the different number of in-context demonstrations on SST-2 and MNLI adversarial dataset. The experiment is conducted on gpt-3.5-turbo.
  • Figure 5: Visualization of a sample's gradient on each word when fine-tuning CodeT5 with code-style instruction and T5 with natural language instruction, respectively. The sample is selected from the Restaurant-T dataset with both its clean and adversarial versions. The sample aims to determine the sentiment polarity of the aspect "battery life" in the sentence with "positive" or "negative".