Table of Contents
Fetching ...

DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

Olivia Kim

TL;DR

DETAIL investigates how the degree of prompt specificity shapes reasoning performance in large language models. It introduces a prompt abstraction framework, a perplexity-based specificity metric, and a semantic equivalence evaluation to study multi-level prompts across GPT-4 and O3-mini. The study on 30 novel reasoning tasks across domains shows that prompt specificity yields task- and model-dependent gains, with structured prompting mitigating under-specification particularly for smaller models. The work provides dataset, tools, and a principled methodology to guide adaptive prompting and benchmarking of LLM reasoning.

Abstract

Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

TL;DR

DETAIL investigates how the degree of prompt specificity shapes reasoning performance in large language models. It introduces a prompt abstraction framework, a perplexity-based specificity metric, and a semantic equivalence evaluation to study multi-level prompts across GPT-4 and O3-mini. The study on 30 novel reasoning tasks across domains shows that prompt specificity yields task- and model-dependent gains, with structured prompting mitigating under-specification particularly for smaller models. The work provides dataset, tools, and a principled methodology to guide adaptive prompting and benchmarking of LLM reasoning.

Abstract

Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

Paper Structure

This paper contains 27 sections, 4 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Accuracy of GPT-4 and ChatGPT O3-mini across prompt specificity levels (Level-1: vague, Level-2: moderate, Level-3: detailed) and prompting strategies.
  • Figure :