Table of Contents
Fetching ...

Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

Joonatan Laato, Jenna Kanerva, John Loehr, Virpi Lummaa, Filip Ginter

TL;DR

This paper investigates zero-shot information extraction of social organizations and hobbies from a vast corpus of Finnish Karelia refugee interviews, comparing GPT-4, open LLMs, and a FinBERT-based supervised approach. Through careful prompt engineering, batching analyses, and language choices, the authors show GPT-4 achieves an F1 around 88–89%, while a strong open model (Llama-3-70B-Instruct) nearly matches at ~87–88%, and FinBERT reaches mid-80s when trained on GPT-4-derived data. The study also provides a full-data extraction, revealing hundreds of thousands of hobby and organization mentions, and analyzes energy costs and scalability. Collectively, the results highlight the viability of open and hybrid approaches for large-scale information extraction in non-English historical corpora, with practical implications for digital humanities and migration research.

Abstract

We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.

Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

TL;DR

This paper investigates zero-shot information extraction of social organizations and hobbies from a vast corpus of Finnish Karelia refugee interviews, comparing GPT-4, open LLMs, and a FinBERT-based supervised approach. Through careful prompt engineering, batching analyses, and language choices, the authors show GPT-4 achieves an F1 around 88–89%, while a strong open model (Llama-3-70B-Instruct) nearly matches at ~87–88%, and FinBERT reaches mid-80s when trained on GPT-4-derived data. The study also provides a full-data extraction, revealing hundreds of thousands of hobby and organization mentions, and analyzes energy costs and scalability. Collectively, the results highlight the viability of open and hybrid approaches for large-scale information extraction in non-English historical corpora, with practical implications for digital humanities and migration research.

Abstract

We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Two interviews translated into English for illustration purposes. The relevant information (hobbies, social organizations) related to the primary person (the one being interviewed) is illustrated in yellow, while blue indicates the spouse.
  • Figure 2: The final prompt as used in our experiments paired with a sample interview.
  • Figure 3: Example response obtained from the GPT-4 API (translated from Finnish).
  • Figure 4: Process of creating NER-like data for model fine-tuning. English translation of the example is: Mrs. serves as the secretary of the Loppi Kuparsaari Martha's Association. Hash symbols (##) indicate the subword tokenization as produced by the FinBERT language model.
  • Figure 5: Performance of a fine-tuned BERT model with increasing training dataset size in terms of the number of training interviews. The model peaks at 30k training examples, corresponding of using 33% of the full data for training (30k/90k). However, quite competitive results are already obtained when using 3k training examples (3% of the full data). The dashed line represents the F-score of GPT-4.