Table of Contents
Fetching ...

Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge

Xiao Liu, Zirui Wu, Jiayi Li, Zhicheng Shao, Xun Pang, Yansong Feng

TL;DR

This paper tackles the challenge of producing timely, high-quality annotations for evolving longitudinal network data, using climate negotiations as a testbed. It introduces the Expert-Augmented LLM Annotation (EALA) framework, which fuses large language models with an expert-crafted codebook and human-annotated history to extend datasets across time. Through extensive experiments on the climate negotiation corpus, EALA demonstrates strong performance, outpacing traditional supervised models and nearing human annotator levels, with instruction-tuning identified as the dominant contributor and task decomposition mitigating hallucinations. The work highlights both the promise and limitations of LLM-driven longitudinal annotation, noting the need for post-processing, careful handling of dynamic label spaces, and potential extensions using retrieval-augmented and chain-of-thought strategies for real-world deployment.

Abstract

Longitudinal network data are essential for analyzing political, economic, and social systems and processes. In political science, these datasets are often generated through human annotation or supervised machine learning applied to evolving corpora. However, as semantic contexts shift over time, inferring dynamic interaction types on emerging issues among a diverse set of entities poses significant challenges, particularly in maintaining timely and consistent annotations. This paper presents the Expert-Augmented LLM Annotation (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods. We evaluate the performance and reliability of EALA using a dataset of climate negotiations. Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time. At the same time, we identify several limitations inherent to LLM-based annotation, highlighting areas for further improvement. Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.

Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge

TL;DR

This paper tackles the challenge of producing timely, high-quality annotations for evolving longitudinal network data, using climate negotiations as a testbed. It introduces the Expert-Augmented LLM Annotation (EALA) framework, which fuses large language models with an expert-crafted codebook and human-annotated history to extend datasets across time. Through extensive experiments on the climate negotiation corpus, EALA demonstrates strong performance, outpacing traditional supervised models and nearing human annotator levels, with instruction-tuning identified as the dominant contributor and task decomposition mitigating hallucinations. The work highlights both the promise and limitations of LLM-driven longitudinal annotation, noting the need for post-processing, careful handling of dynamic label spaces, and potential extensions using retrieval-augmented and chain-of-thought strategies for real-world deployment.

Abstract

Longitudinal network data are essential for analyzing political, economic, and social systems and processes. In political science, these datasets are often generated through human annotation or supervised machine learning applied to evolving corpora. However, as semantic contexts shift over time, inferring dynamic interaction types on emerging issues among a diverse set of entities poses significant challenges, particularly in maintaining timely and consistent annotations. This paper presents the Expert-Augmented LLM Annotation (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods. We evaluate the performance and reliability of EALA using a dataset of climate negotiations. Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time. At the same time, we identify several limitations inherent to LLM-based annotation, highlighting areas for further improvement. Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.

Paper Structure

This paper contains 33 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the EALA framework.
  • Figure 2: Label definitions and coding rules in the codebook (illustrated with the climate negotiation example).
  • Figure 3: Frequency of the term net zero from Google Ngram Viewer.
  • Figure 4: Yearly distribution of reports and interactions.
  • Figure 5: Degree of activity of negotiation entities.
  • ...and 8 more figures