Table of Contents
Fetching ...

Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement

Pengwei Zhan, Zhen Xu, Qian Tan, Jie Song, Ru Xie

TL;DR

The paper shows that large language models exhibit pronounced sensitivity to lexical variations in prompts, even when changes are nearly imperceptible to humans. It introduces COPLE, a black-box combinatorial optimization framework that iteratively substitutes semantically similar words in the task description based on feedback from proxy tasks to maximize downstream performance. Across GLUE and MMLU benchmarks and multiple models, COPLE substantially improves results relative to human-crafted prompts and other prompting baselines, demonstrating that lexical optimization can recover instruction-following and task-solving abilities. The work highlights the importance of evaluating and optimizing the exact wording of prompts prior to more complex prompt engineering, with implications for robustness and reproducibility of LLM-based systems.

Abstract

Large language models (LLMs) demonstrate exceptional instruct-following ability to complete various downstream tasks. Although this impressive ability makes LLMs flexible task solvers, their performance in solving tasks also heavily relies on instructions. In this paper, we reveal that LLMs are over-sensitive to lexical variations in task instructions, even when the variations are imperceptible to humans. By providing models with neighborhood instructions, which are closely situated in the latent representation space and differ by only one semantically similar word, the performance on downstream tasks can be vastly different. Following this property, we propose a black-box Combinatorial Optimization framework for Prompt Lexical Enhancement (COPLE). COPLE performs iterative lexical optimization according to the feedback from a batch of proxy tasks, using a search strategy related to word influence. Experiments show that even widely-used human-crafted prompts for current benchmarks suffer from the lexical sensitivity of models, and COPLE recovers the declined model ability in both instruct-following and solving downstream tasks.

Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement

TL;DR

The paper shows that large language models exhibit pronounced sensitivity to lexical variations in prompts, even when changes are nearly imperceptible to humans. It introduces COPLE, a black-box combinatorial optimization framework that iteratively substitutes semantically similar words in the task description based on feedback from proxy tasks to maximize downstream performance. Across GLUE and MMLU benchmarks and multiple models, COPLE substantially improves results relative to human-crafted prompts and other prompting baselines, demonstrating that lexical optimization can recover instruction-following and task-solving abilities. The work highlights the importance of evaluating and optimizing the exact wording of prompts prior to more complex prompt engineering, with implications for robustness and reproducibility of LLM-based systems.

Abstract

Large language models (LLMs) demonstrate exceptional instruct-following ability to complete various downstream tasks. Although this impressive ability makes LLMs flexible task solvers, their performance in solving tasks also heavily relies on instructions. In this paper, we reveal that LLMs are over-sensitive to lexical variations in task instructions, even when the variations are imperceptible to humans. By providing models with neighborhood instructions, which are closely situated in the latent representation space and differ by only one semantically similar word, the performance on downstream tasks can be vastly different. Following this property, we propose a black-box Combinatorial Optimization framework for Prompt Lexical Enhancement (COPLE). COPLE performs iterative lexical optimization according to the feedback from a batch of proxy tasks, using a search strategy related to word influence. Experiments show that even widely-used human-crafted prompts for current benchmarks suffer from the lexical sensitivity of models, and COPLE recovers the declined model ability in both instruct-following and solving downstream tasks.
Paper Structure (34 sections, 9 equations, 6 figures, 19 tables)

This paper contains 34 sections, 9 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Prompt lexical enhancement from a combinatorial optimization perspective. Initially, we provide the prompt "Please identify whether the sentences have the same meaning" for Llama-2-7B-chat to complete the tasks from Quora Question Pairs2 (QQP), and combine the validation set of QQP with the prompt as a predefined task pool, with each example being an individual task. By iteratively substituting the most influential words in the prompt with semantically similar words picked from the potential search space, we find the optimal prompt "Please identify since the sentences repeat the same theme" that increases the accuracy from 35% to 57%. The details of operations can be found in §\ref{['cople']}.
  • Figure 2: The visualization of model performance on CoLA and MMLU-STEM validation set with neighborhood prompts. The task description of the original prompt picked for CoLA is "Does this sentence make sense?", and for MMLU-STEM is "The following are multiple choice questions (with answers) about {task}", where {task} is a placeholder to replace with detailed subset type, e.g., "abstract algebra". The point ● in lighter color indicates better performance, and the square $\blacksquare$ indicates the original prompt, with the $\blacktriangleright$ in the color bar indicating the original performance. The words in the upper prompts indicate the changed words, and words indicate the substitutions.
  • Figure 3: Impact of the number of words changed in prompt on downstream performance.
  • Figure 4: Impact of the number of sampled examples in proxy reference tasks on downstream performance.
  • Figure 5: Impact of the number of candidate words in search space on downstream performance.
  • ...and 1 more figures