Table of Contents
Fetching ...

Towards LLMs Robustness to Changes in Prompt Format Styles

Lilian Ngweta, Kiran Kate, Jason Tsay, Yara Rizk

TL;DR

The paper tackles the problem of prompt brittleness in LLMs caused by non-semantic changes to prompt format styles. It introduces Mixture of Formats (MOF), a simple technique that assigns distinct style formats to each few-shot example and has the model rewrite examples in different formats to discourage style–target associations. Across 16 SuperNaturalInstructions datasets and four LLMs, MOF reduces style-induced performance spread and often matches or exceeds traditional prompts in mean accuracy, demonstrating improved robustness without heavy per-task optimization. The work highlights MOF's practical impact as an efficient, complementary prompting strategy, and outlines future directions including integration with CoT and automatic prompt optimization approaches, as well as evaluation on larger models, with code publicly available.

Abstract

Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.

Towards LLMs Robustness to Changes in Prompt Format Styles

TL;DR

The paper tackles the problem of prompt brittleness in LLMs caused by non-semantic changes to prompt format styles. It introduces Mixture of Formats (MOF), a simple technique that assigns distinct style formats to each few-shot example and has the model rewrite examples in different formats to discourage style–target associations. Across 16 SuperNaturalInstructions datasets and four LLMs, MOF reduces style-induced performance spread and often matches or exceeds traditional prompts in mean accuracy, demonstrating improved robustness without heavy per-task optimization. The work highlights MOF's practical impact as an efficient, complementary prompting strategy, and outlines future directions including integration with CoT and automatic prompt optimization approaches, as well as evaluation on larger models, with code publicly available.

Abstract

Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.

Paper Structure

This paper contains 17 sections, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: A demonstration of how small changes to the prompt format style can sometimes lead to incorrect predictions in LLMs.
  • Figure 2: An illustration of how to convert a traditional prompt into a MOF prompt. This example serves as a simple demonstration of the conversion process. In the actual experiments, datasets use various formats such as Passage:: {} , Answer:: {} for dataset task280, SYSTEM REFERENCE : {}. ORIGINAL REFERENCE : {}. ANSWER : {} for dataset task1186, and Tweet:{} , Label:{} , Answer:{} for dataset task905. These formats are generated using FormatSpread sclar2023quantifying, as described in Section \ref{['exp']}. The datasets used are described in Table \ref{['tab:datasets']}.
  • Figure 3: Comparing the performance spread of traditional prompts and MOF prompts. Spread is a metric for quantifying style-induced prompt brittleness and it is obtained by taking the difference between the best performing prompt (maximum accuracy) and the worst performing prompt (minimum accuracy). MOF prompts perform comparably or outperform traditional prompts in most datasets and in some datasets, traditional prompts have better performance.