Table of Contents
Fetching ...

Towards Generating Automatic Anaphora Annotations

Dima Taji, Daniel Zeman

TL;DR

The paper addresses the data scarcity problem in anaphora and coreference resolution by exploring automatic data generation strategies. It evaluates two parallel tracks: converting existing resources into the CorefUD format (with Arabic OntoNotes) and using multilingual models (CorPipe) to annotate the Universal Dependencies corpora. The proposed datasets aim to provide scalable resources for multilingual coreference, with initial Arabic OntoNotes CorefUD and UD corpora annotated via CorPipe. The work is ongoing, with plans to publish final conversions, evaluations, and error analyses after manual validation.

Abstract

Training models that can perform well on various NLP tasks require large amounts of data, and this becomes more apparent with nuanced tasks such as anaphora and conference resolution. To combat the prohibitive costs of creating manual gold annotated data, this paper explores two methods to automatically create datasets with coreferential annotations; direct conversion from existing datasets, and parsing using multilingual models capable of handling new and unseen languages. The paper details the current progress on those two fronts, as well as the challenges the efforts currently face, and our approach to overcoming these challenges.

Towards Generating Automatic Anaphora Annotations

TL;DR

The paper addresses the data scarcity problem in anaphora and coreference resolution by exploring automatic data generation strategies. It evaluates two parallel tracks: converting existing resources into the CorefUD format (with Arabic OntoNotes) and using multilingual models (CorPipe) to annotate the Universal Dependencies corpora. The proposed datasets aim to provide scalable resources for multilingual coreference, with initial Arabic OntoNotes CorefUD and UD corpora annotated via CorPipe. The work is ongoing, with plans to publish final conversions, evaluations, and error analyses after manual validation.

Abstract

Training models that can perform well on various NLP tasks require large amounts of data, and this becomes more apparent with nuanced tasks such as anaphora and conference resolution. To combat the prohibitive costs of creating manual gold annotated data, this paper explores two methods to automatically create datasets with coreferential annotations; direct conversion from existing datasets, and parsing using multilingual models capable of handling new and unseen languages. The paper details the current progress on those two fronts, as well as the challenges the efforts currently face, and our approach to overcoming these challenges.

Paper Structure

This paper contains 12 sections, 2 tables.