Actor Identification in Discourse: A Challenge for LLMs?

Ana Barić; Sean Papay; Sebastian Padó

Actor Identification in Discourse: A Challenge for LLMs?

Ana Barić, Sean Papay, Sebastian Padó

TL;DR

The paper tackles actor identification in discourse networks by comparing a traditional CRF/XLM-RoBERTa pipeline with an end-to-end Llama 2 LLM approach on a German newspaper dataset. It shows that the traditional pipeline generally achieves higher exact-match $F_1$ due to stronger control over canonical form, while the LLM better locates the correct actor but struggles with canonicalization. A hybrid approach that post-processes LLM outputs within the pipeline yields the best overall performance, especially under relaxed canonicalization criteria. This work highlights controllability challenges in LLM generation for canonical naming and suggests retrieval-augmented or hybrid strategies as practical paths forward for discourse-network construction.

Abstract

The identification of political actors who put forward claims in public debate is a crucial step in the construction of discourse networks, which are helpful to analyze societal debates. Actor identification is, however, rather challenging: Often, the locally mentioned speaker of a claim is only a pronoun ("He proposed that [claim]"), so recovering the canonical actor name requires discourse understanding. We compare a traditional pipeline of dedicated NLP components (similar to those applied to the related task of coreference) with a LLM, which appears a good match for this generation task. Evaluating on a corpus of German actors in newspaper reports, we find surprisingly that the LLM performs worse. Further analysis reveals that the LLM is very good at identifying the right reference, but struggles to generate the correct canonical form. This points to an underlying issue in LLMs with controlling generated output. Indeed, a hybrid model combining the LLM with a classifier to normalize its output substantially outperforms both initial models.

Actor Identification in Discourse: A Challenge for LLMs?

TL;DR

due to stronger control over canonical form, while the LLM better locates the correct actor but struggles with canonicalization. A hybrid approach that post-processes LLM outputs within the pipeline yields the best overall performance, especially under relaxed canonicalization criteria. This work highlights controllability challenges in LLM generation for canonical naming and suggests retrieval-augmented or hybrid strategies as practical paths forward for discourse-network construction.

Abstract

Paper Structure (13 sections, 1 figure, 5 tables)

This paper contains 13 sections, 1 figure, 5 tables.

Introduction
Methods
Actor Identification: Task Definition
A Traditional Pipeline Architecture
An LLM-Based Architecture
Experimental Setup
Data
Evaluation
Results and Analysis
Main results.
Hybrid model.
Conclusion
Prompt Templates

Figures (1)

Figure 1: Discourse network with actors as circles and claims as squares (adapted from pado-etal-2019-sides)

Actor Identification in Discourse: A Challenge for LLMs?

TL;DR

Abstract

Actor Identification in Discourse: A Challenge for LLMs?

Authors

TL;DR

Abstract

Table of Contents

Figures (1)