Table of Contents
Fetching ...

Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

Erkan Gunes, Christoffer Florczak, Tevfik Murat Yildirim

Abstract

Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

Navigating the Prompt Space: Improving LLM Classification of Social Science Texts Through Prompt Engineering

Abstract

Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.

Paper Structure

This paper contains 7 sections, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Prompt components that vary across experimental configurations
  • Figure 2: Issue topic classification performance of GPT 4o across different contextual information configurations and input text batch sizes
  • Figure 3: Issue topic classification performance of Gemini 2.0 Flash across different contextual information configurations and input text batch sizes
  • Figure 4: Emotionality classification performance of GPT 4o across different contextual information configurations and input text batch sizes
  • Figure 5: Emotionality classification performance of Gemini 2.0 Flash across different contextual information configurations and input text batch sizes