Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge
Dimitri Schreiter
TL;DR
This work probes whether increasing prompt vocabulary specificity boosts domain-specific QA and reasoning by introducing a systematic synonymization framework guided by WordNet; it evaluates noun, verb, and adjective replacements across STEM, medicine, and law using four diverse LLMs on MMLU, GPQA, and GSM8K. The study defines $S_{ ext{noun/verb}}$ and $S_{ ext{adjectives}}$ to quantify specificity and uses WSD to align synonyms with context, generating multiple prompt variants at 33%, 67%, and 100% replacement rates. Across datasets and models, higher specificity rarely yields consistent performance gains; instead, there exists an optimal specificity range where results peak, with verbs showing the strongest negative impact on reasoning tasks. The findings suggest that prompt design in specialized domains should balance specificity with generality and contextual appropriateness, rather than pursuing blanket increases in lexical specificity, and they introduce an adjective-specific measure whose practical impact requires further validation.
Abstract
Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.
