Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Shreyas Meher

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Shreyas Meher

Abstract

Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Abstract

Paper Structure (42 sections, 3 equations, 5 figures, 13 tables)

This paper contains 42 sections, 3 equations, 5 figures, 13 tables.

Introduction
Related Work
The Build-Borrow-Buy Spectrum in NLP for Political Science
The Case for Building: Domain-Specific Pretraining
The Rising Floor: Why Borrowing Is Getting Better
Data
The Global Terrorism Database
Class Imbalance
Temporal Split
Methodology
Base Model Selection
Multi-Label Classification Architecture
Class Imbalance and Loss Weighting
Training Configuration
Evaluation Metrics
...and 27 more sections

Figures (5)

Figure 1: AUC performance of ConfliBERT and Confli-mBERT across attack types, plotted against class prevalence on a logarithmic scale. The two models converge in performance as class size increases, but ConfliBERT mantains a strict advantage.
Figure 2: The AUC difference between ConfliBERT and Confli-mBERT, plotted against class prevalence (log scale). The dashed red line shows the logarithmic trend. The performance gap narrows toward zero for common attack types.
Figure 3: Overall classification accuracy across nine models evaluated on a stratified 2,000-event sample from the GTD test set. The three fine-tuned models (dark bars) form a distinct performance tier above the commercial APIs (medium bars), which in turn outperform the locally deployed open-source models used without any fine-tuning (light bars). The gap between the weakest fine-tuned model (ConflLlama, 72.9%) and the strongest commercial API (Gemini Flash, 65.9%) is nearly seven percentage points.
Figure 4: Model size (total parameters, log scale) plotted against Micro F1 classification performance. Fine-tuned models ($\bigstar$) cluster in the upper left: small models with high performance. Commercial APIs ($\blacklozenge$) and open-source models ($\bullet$) occupy the lower right: large models with mediocre to poor performance. ConfliBERT, at 110 million parameters, outperforms DeepSeek V3.2, a model roughly 6,200 times its size. The relationship between size and performance is negative across model families, demonstrating that task-specific fine-tuning overwhelmingly dominates model scale for classification tasks.
Figure 5: F1 scores for all nine models across all nine attack types. Green cells indicate strong performance ($\geq$0.70); yellow indicates moderate performance (0.40--0.69); red indicates poor performance ($<$0.40). The leftmost three columns (fine-tuned models) are predominantly green across most attack types. The rightmost six columns (zero-shot models) show high variance: strong performance on Bombing/Explosion but near-zero performance on Unknown and Barricade Incident.

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Abstract

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

Authors

Abstract

Table of Contents

Figures (5)