Table of Contents
Fetching ...

Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge

Dimitri Schreiter

TL;DR

This work probes whether increasing prompt vocabulary specificity boosts domain-specific QA and reasoning by introducing a systematic synonymization framework guided by WordNet; it evaluates noun, verb, and adjective replacements across STEM, medicine, and law using four diverse LLMs on MMLU, GPQA, and GSM8K. The study defines $S_{ ext{noun/verb}}$ and $S_{ ext{adjectives}}$ to quantify specificity and uses WSD to align synonyms with context, generating multiple prompt variants at 33%, 67%, and 100% replacement rates. Across datasets and models, higher specificity rarely yields consistent performance gains; instead, there exists an optimal specificity range where results peak, with verbs showing the strongest negative impact on reasoning tasks. The findings suggest that prompt design in specialized domains should balance specificity with generality and contextual appropriateness, rather than pursuing blanket increases in lexical specificity, and they introduce an adjective-specific measure whose practical impact requires further validation.

Abstract

Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.

Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge

TL;DR

This work probes whether increasing prompt vocabulary specificity boosts domain-specific QA and reasoning by introducing a systematic synonymization framework guided by WordNet; it evaluates noun, verb, and adjective replacements across STEM, medicine, and law using four diverse LLMs on MMLU, GPQA, and GSM8K. The study defines and to quantify specificity and uses WSD to align synonyms with context, generating multiple prompt variants at 33%, 67%, and 100% replacement rates. Across datasets and models, higher specificity rarely yields consistent performance gains; instead, there exists an optimal specificity range where results peak, with verbs showing the strongest negative impact on reasoning tasks. The findings suggest that prompt design in specialized domains should balance specificity with generality and contextual appropriateness, rather than pursuing blanket increases in lexical specificity, and they introduce an adjective-specific measure whose practical impact requires further validation.

Abstract

Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.

Paper Structure

This paper contains 15 sections, 5 equations, 27 figures, 12 tables.

Figures (27)

  • Figure 1: Specificity-based Synonymization Framework. Representation of the specificity-based synonymization framework used to synonymize the prompt instructions with varying specificities of all datasets. The preprocessing includes five key steps starting with the retrieval of parts (I) of speech from the original instruction, crawling synonyms (II) and calculate the specificity scores (III) for all parts of speech (the green colored boxes include WSD in the algorithm), categorizing the synonyms (IV) into low, intermediate, high specificity and finally synonymize the original instructions (V) with three different replacement ratios (33%, 67%, 100%). Additionally, there is a step-by-step example sampled from the GPQA dataset, synonymizing nouns with varying specificity synonyms.
  • Figure 2: Confusion Matrix of WSD Model Evaluation. The performance agreement of Llama-3.1-70B-Instruct and finetuned T5 for WSD when predicting against the human evaluated ground truth.
  • Figure 3: Spearman-Correlations for Adjective Ranking. The histogram displays the distribution of Spearman correlations from the LLM-as-a-judge experiment, which compares the model's qualitative ranking of adjective specificity with the calculated specificity score measure for 91 samples. The median Spearman correlation in this distribution is 0.50.
  • Figure 4: Example for Prompt Specificity Calculation. This example schematically illustrates the calculation of the prompt specificity, by aggregating the specificities of one part of speech (nouns in this case) and calculating the average that we call prompt specificity. Additionally, it shows the prompt specificity change from 19.36 to 20.60 after we substitute the noun dog with the more specific synonym poodle.
  • Figure 5: Specificity Score Distribution for nouns, verbs and adjectives. The histograms show the distribution of specificity scores for the respective part of speech. The mean specificity score for nouns is $\mu_{\text{nouns}} = 21.37$, for verbs $\mu_{\text{verbs}} = 12.09$ and for adjectives $\mu_\text{adjectives} = 0.40$
  • ...and 22 more figures