Table of Contents
Fetching ...

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Shreyas Meher

Abstract

Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Abstract

Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.
Paper Structure (42 sections, 3 equations, 5 figures, 13 tables)

This paper contains 42 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: AUC performance of ConfliBERT and Confli-mBERT across attack types, plotted against class prevalence on a logarithmic scale. The two models converge in performance as class size increases, but ConfliBERT mantains a strict advantage.
  • Figure 2: The AUC difference between ConfliBERT and Confli-mBERT, plotted against class prevalence (log scale). The dashed red line shows the logarithmic trend. The performance gap narrows toward zero for common attack types.
  • Figure 3: Overall classification accuracy across nine models evaluated on a stratified 2,000-event sample from the GTD test set. The three fine-tuned models (dark bars) form a distinct performance tier above the commercial APIs (medium bars), which in turn outperform the locally deployed open-source models used without any fine-tuning (light bars). The gap between the weakest fine-tuned model (ConflLlama, 72.9%) and the strongest commercial API (Gemini Flash, 65.9%) is nearly seven percentage points.
  • Figure 4: Model size (total parameters, log scale) plotted against Micro F1 classification performance. Fine-tuned models ($\bigstar$) cluster in the upper left: small models with high performance. Commercial APIs ($\blacklozenge$) and open-source models ($\bullet$) occupy the lower right: large models with mediocre to poor performance. ConfliBERT, at 110 million parameters, outperforms DeepSeek V3.2, a model roughly 6,200 times its size. The relationship between size and performance is negative across model families, demonstrating that task-specific fine-tuning overwhelmingly dominates model scale for classification tasks.
  • Figure 5: F1 scores for all nine models across all nine attack types. Green cells indicate strong performance ($\geq$0.70); yellow indicates moderate performance (0.40--0.69); red indicates poor performance ($<$0.40). The leftmost three columns (fine-tuned models) are predominantly green across most attack types. The rightmost six columns (zero-shot models) show high variance: strong performance on Bombing/Explosion but near-zero performance on Unknown and Barricade Incident.