Table of Contents
Fetching ...

Paraphrase Types Elicit Prompt Engineering Capabilities

Jan Philip Wahle, Terry Ruas, Yang Xu, Bela Gipp

TL;DR

The paper investigates how linguistic variations in prompts—organized into six paraphrase families (morphology, syntax, lexicon, lexico-syntax, discourse, others)—influence the behavior of large language models across 120 tasks. It conducts a large-scale, controlled empirical study over five models and 26 paraphrase types, using the Super-NaturalInstructions dataset, to quantify performance changes and disentangle effects from prompt length, lexical diversity, and training-data proximity. Key findings show notable gains from morphology and lexicon perturbations, with gains up to 13.4% in some smaller models, and demonstrate that effects are task- and model-dependent; prompt complexity metrics and data proximity largely do not account for the gains. The work provides practical guidance for prompt engineering and suggests that leveraging paraphrase-type variations can improve robustness to linguistic variability in real-world applications of LLMs.

Abstract

Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure behavioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic expression.

Paraphrase Types Elicit Prompt Engineering Capabilities

TL;DR

The paper investigates how linguistic variations in prompts—organized into six paraphrase families (morphology, syntax, lexicon, lexico-syntax, discourse, others)—influence the behavior of large language models across 120 tasks. It conducts a large-scale, controlled empirical study over five models and 26 paraphrase types, using the Super-NaturalInstructions dataset, to quantify performance changes and disentangle effects from prompt length, lexical diversity, and training-data proximity. Key findings show notable gains from morphology and lexicon perturbations, with gains up to 13.4% in some smaller models, and demonstrate that effects are task- and model-dependent; prompt complexity metrics and data proximity largely do not account for the gains. The work provides practical guidance for prompt engineering and suggests that leveraging paraphrase-type variations can improve robustness to linguistic variability in real-world applications of LLMs.

Abstract

Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure behavioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic expression.
Paper Structure (13 sections, 12 equations, 12 figures, 15 tables)

This paper contains 13 sections, 12 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: The potential median task performance gain (blue) over the model's baseline performance (orange) of five chat models across 120 tasks when their prompts were adjusted for specific paraphrase types (e.g., lexicon, syntax, morphology).
  • Figure 2: The main method of this paper. We paraphrase prompts of 120 tasks from 24 task families using 26 linguistic types of six categories (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others) to analyze model inputs and outputs across different factors.
  • Figure 3: The average downstream task performance gain or loss from applying specific paraphrase types to the prompt for all 120 tasks and five models.
  • Figure 4: The avg. gain or loss in performance for all 120 tasks in 24 different task families across all five models.
  • Figure 5: The distribution of how much closer the paraphrased prompt is to the closest training example in FineWeb 350BT over the original prompt (x-axis) and the distribution of task performance (y-axis). Red colors mean high mass and blue colors mean low mass between $\Delta_{train}$ and task performance.
  • ...and 7 more figures