Table of Contents
Fetching ...

LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Aida Kostikova, Ole Pütz, Steffen Eger, Olga Sabelfeld, Benjamin Paassen

Abstract

Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.

LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Abstract

Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.

Paper Structure

This paper contains 19 sections, 4 equations, 17 figures, 20 tables.

Figures (17)

  • Figure E.18: Macro F1 scores over time for each model (best configuration), and average pairwise Cohen's Kappa for human annotators. Evaluated on the full test set, including Frau and both test sets for Migrant, grouped by decade.
  • Figure E.19: High-level row-normalized confusion matrices (%) for the best-performing configuration of each model on the two migrant test sets (Test 1 and Test 2, combined). Each row sums to 100, showing how items with a given reference label are distributed across predicted labels. Human Annotators (LOO) shows aggregated leave-one-out annotator comparisons, where each annotator is compared to the consensus of the remaining annotators. The Ensemble model aggregates predictions from Llama-3.3-70B, Qwen-2.5-72B, and gpt-oss-120B.
  • Figure E.20: Fine-grained row-normalized confusion matrices (%) showing the best-performing configuration for each of the selected models on the two migrant test sets (Test 1 and Test 2), combined. Each row sums to 100, showing how items with a given reference label are distributed across predicted labels. Human Annotators (LOO) shows aggregated leave-one-out annotator comparisons, where each annotator is compared to the consensus of the remaining annotators. The Ensemble model aggregates predictions from Llama-3.3-70B, Qwen-2.5-72B, and gpt-oss-120B.
  • Figure E.21: Distribution of all Migrant keywords over the years, normalized per keyword. The keywords are sorted by frequency, which means that the reliability decreases towards the bottom-right.
  • Figure E.22: (Anti-)solidarity trends for the Migrant category by keyword and decade, 1867-2025. Each panel shows DSL-adjusted decade-level shares of solidarity and anti-solidarity for one keyword (see definition (c) in \ref{['box:trend-definitions']}). The mixed and none categories are included in the four-class distribution but omitted from the visualization. Keywords are sorted by frequency, so estimates become less reliable toward the bottom right.
  • ...and 12 more figures