Towards LLMs Robustness to Changes in Prompt Format Styles
Lilian Ngweta, Kiran Kate, Jason Tsay, Yara Rizk
TL;DR
The paper tackles the problem of prompt brittleness in LLMs caused by non-semantic changes to prompt format styles. It introduces Mixture of Formats (MOF), a simple technique that assigns distinct style formats to each few-shot example and has the model rewrite examples in different formats to discourage style–target associations. Across 16 SuperNaturalInstructions datasets and four LLMs, MOF reduces style-induced performance spread and often matches or exceeds traditional prompts in mean accuracy, demonstrating improved robustness without heavy per-task optimization. The work highlights MOF's practical impact as an efficient, complementary prompting strategy, and outlines future directions including integration with CoT and automatic prompt optimization approaches, as well as evaluation on larger models, with code publicly available.
Abstract
Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.
