Table of Contents
Fetching ...

Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?

Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

TL;DR

The paper presents the fourth Shared Task on Multilingual Coreference Resolution, introducing an LLM Track and expanding CorefUD-based data to 17 languages. It compares nine systems across two tracks—the LLM-focused approaches and traditional Unconstrained systems—using a plaintext encoding for LLMs and the CorefUD scorer with CoNLL F1 as the primary metric. Key contributions include the integration of CorefUD 1.3 resources, data reductions, plaintext evaluation formats, and a diverse mix of end-to-end and pipeline-based systems, including the CorPipe ensembles. The results show traditional, well-tuned systems (e.g., CorPipe ensembles) still outperform LLM-based approaches on most datasets, though LLMs display potential, with some datasets where they surpass non-LLM systems, indicating directions for future improvements in prompting, data annotation, and representation of coreference in language models.

Abstract

The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference. A key innovation of this year's task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation. The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD - a harmonized multilingual collection of 22 datasets in 17 languages. In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions.

Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?

TL;DR

The paper presents the fourth Shared Task on Multilingual Coreference Resolution, introducing an LLM Track and expanding CorefUD-based data to 17 languages. It compares nine systems across two tracks—the LLM-focused approaches and traditional Unconstrained systems—using a plaintext encoding for LLMs and the CorefUD scorer with CoNLL F1 as the primary metric. Key contributions include the integration of CorefUD 1.3 resources, data reductions, plaintext evaluation formats, and a diverse mix of end-to-end and pipeline-based systems, including the CorPipe ensembles. The results show traditional, well-tuned systems (e.g., CorPipe ensembles) still outperform LLM-based approaches on most datasets, though LLMs display potential, with some datasets where they surpass non-LLM systems, indicating directions for future improvements in prompting, data annotation, and representation of coreference in language models.

Abstract

The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference. A key innovation of this year's task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation. The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD - a harmonized multilingual collection of 22 datasets in 17 languages. In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions.

Paper Structure

This paper contains 53 sections, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Our plaintext serialization of a Spanish example sentence from es_ancora. For clarity, mention spans are highlighted by colored underlining, where two coreferential entities share the same color. A zero mention labeled on an empty node is greyed. Note that multi-word tokens are split in the plaintext format into syntactic words (e.g., the Spanish "del" appears as "de el"); this conversion error was identified after the data release.
  • Figure 2: Plot with results for individual languages in the primary metric (CoNLL F$_1$). This plot shows the same information as Table \ref{['tab:all-langs']}, but languages are sorted according to the performance of the best system and LLM-based systems are shown with dashed lines.
  • Figure 3: Evolution of CodaLab Submissions in the evaluation phase. The submissions to the LLM and Unconstrained track are shown by using the dashed and solid lines, respectively. For clarity, all submissions of the ÚFAL CorPipe team are represented by the scores of CorPipeEnsemble.