Table of Contents
Fetching ...

Prompt Refinement or Fine-tuning? Best Practices for using LLMs in Computational Social Science Tasks

Anders Giovanni Møller, Luca Maria Aiello

TL;DR

The paper evaluates six LLM-based CSS classification approaches (zero-shot, AI-knowledge prompting, RAG, fine-tuning, instruction tuning, reverse instruction tuning) on a SOCKET-based benchmark of 23 tasks using two Llama models. It finds that larger pretraining and vocabulary benefit performance, AI-enhanced prompts outperform plain zero-shot, and fine-tuning yields robust gains, while advanced instruction-tuning can degrade results for less expressive bases. The study provides practical guidelines for practitioners—prioritize strong pretraining, employ task-informed prompts, and apply fine-tuning when resources permit, with caution for cross-task instruction strategies when data is scarce.

Abstract

Large Language Models are expressive tools that enable complex tasks of text understanding within Computational Social Science. Their versatility, while beneficial, poses a barrier for establishing standardized best practices within the field. To bring clarity on the values of different strategies, we present an overview of the performance of modern LLM-based classification methods on a benchmark of 23 social knowledge tasks. Our results point to three best practices: select models with larger vocabulary and pre-training corpora; avoid simple zero-shot in favor of AI-enhanced prompting; fine-tune on task-specific data, and consider more complex forms instruction-tuning on multiple datasets only when only training data is more abundant.

Prompt Refinement or Fine-tuning? Best Practices for using LLMs in Computational Social Science Tasks

TL;DR

The paper evaluates six LLM-based CSS classification approaches (zero-shot, AI-knowledge prompting, RAG, fine-tuning, instruction tuning, reverse instruction tuning) on a SOCKET-based benchmark of 23 tasks using two Llama models. It finds that larger pretraining and vocabulary benefit performance, AI-enhanced prompts outperform plain zero-shot, and fine-tuning yields robust gains, while advanced instruction-tuning can degrade results for less expressive bases. The study provides practical guidelines for practitioners—prioritize strong pretraining, employ task-informed prompts, and apply fine-tuning when resources permit, with caution for cross-task instruction strategies when data is scarce.

Abstract

Large Language Models are expressive tools that enable complex tasks of text understanding within Computational Social Science. Their versatility, while beneficial, poses a barrier for establishing standardized best practices within the field. To bring clarity on the values of different strategies, we present an overview of the performance of modern LLM-based classification methods on a benchmark of 23 social knowledge tasks. Our results point to three best practices: select models with larger vocabulary and pre-training corpora; avoid simple zero-shot in favor of AI-enhanced prompting; fine-tune on task-specific data, and consider more complex forms instruction-tuning on multiple datasets only when only training data is more abundant.
Paper Structure (11 sections, 1 table)