Table of Contents
Fetching ...

Approaches to Analysing Historical Newspapers Using LLMs

Filip Dobranić, Tina Munda, Oliver Pejić, Vojko Gorjanc, Uroš Šmajdek, David Bordon, Jakob Lenardič, Tjaša Konovšek, Kristina Pahor de Maiti Tekavčič, Ciril Bohak, Darja Fišer

Abstract

This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

Approaches to Analysing Historical Newspapers Using LLMs

Abstract

This study presents a computational analysis of the Slovene historical newspapers \textit{Slovenec} and \textit{Slovenski narod} from the sPeriodika corpus, combining topic modelling, large language model (LLM)-based aspect-level sentiment analysis, entity-graph visualisation, and qualitative discourse analysis to examine how collective identities, political orientations, and national belonging were represented in public discourse at the turn of the twentieth century. Using BERTopic, we identify major thematic patterns and show both shared concerns and clear ideological differences between the two newspapers, reflecting their conservative-Catholic and liberal-progressive orientations. We further evaluate four instruction-following LLMs for targeted sentiment classification in OCR-degraded historical Slovene and select the Slovene-adapted GaMS3-12B-Instruct model as the most suitable for large-scale application, while also documenting important limitations, particularly its stronger performance on neutral sentiment than on positive or negative sentiment. Applied at dataset scale, the model reveals meaningful variation in the portrayal of collective identities, with some groups appearing predominantly in neutral descriptive contexts and others more often in evaluative or conflict-related discourse. We then create NER graphs to explore the relationships between collective identities and places. We apply a mixed methods approach to analyse the named entity graphs, combining quantitative network analysis with critical discourse analysis. The investigation focuses on the emergence and development of intertwined historical political and socionomic identities. Overall, the study demonstrates the value of combining scalable computational methods with critical interpretation to support digital humanities research on noisy historical newspaper data.

Paper Structure

This paper contains 22 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Sentiment class composition (+/-/0) for the five most frequent collective identities in the dataset. For each identity, two stacked bars show the predicted class proportions in Slovenec, and Slovenski narod.
  • Figure 2: Most neutral collective identities. Top 10 collective identities with the highest proportion of neutral sentiment in Slovenec and Slovenski narod. Only identities with min. 50 mentions in each newspaper are included. Identities are ordered by the mean neutral proportion across the two newspapers. Bars represent the proportion of mentions classified as neutral by the model.
  • Figure 3: Most non-neutral collective identities. Top 10 collective identities with the highest proportion of non-neutral sentiment (positive or negative) in Slovenec and Slovenski narod. Only identities with min. 50 mentions in each newspaper are included. Identities are ordered by the mean non-neutral proportion across the two newspapers. Bars represent the combined proportion of positive and negative mentions.
  • Figure 4: Topic (red), identity (green), sentiment (yellow), and location (purple) graph for Slovenski narod. The size of the nodes is the same except for identities, where it correlates to that identity’s relative non-neutral sentiment.
  • Figure 6: A comparison of Slovenec and Slovenski narod paragraph co-occurrences for the Advertisements and announcements theme. Topics are shown in red, identities green, sentiment yellow, and locations purple. The size of the nodes is the same except for identities, where it correlates to that identity’s relative non-neutral sentiment.