Table of Contents
Fetching ...

Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data

Jonas Golde, Patrick Haller, Max Ploner, Fabio Barth, Nicolaas Jedema, Alan Akbik

TL;DR

This paper addresses the overestimation of zero-shot NER performance caused by label overlap between large synthetic training datasets and evaluation benchmarks. It introduces Familiarity, a metric that combines semantic similarity between entity types with training-label support to quantify label shift and transfer difficulty. Through extensive experiments with GLiNER across multiple synthetic datasets and seven benchmarks, the authors show that label overlap inflates transfer performance and that Familiarity correlates with, but does not fully determine, zero-shot F1. The work provides practical tools for fair model comparisons and for constructing evaluation splits with varying transfer difficulty, along with open-source code and benchmark resources; these contributions advance reliable assessment and data-efficient evaluation in zero-shot NER.

Abstract

Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as 'Person' or 'Medicine') without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familiarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.

Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data

TL;DR

This paper addresses the overestimation of zero-shot NER performance caused by label overlap between large synthetic training datasets and evaluation benchmarks. It introduces Familiarity, a metric that combines semantic similarity between entity types with training-label support to quantify label shift and transfer difficulty. Through extensive experiments with GLiNER across multiple synthetic datasets and seven benchmarks, the authors show that label overlap inflates transfer performance and that Familiarity correlates with, but does not fully determine, zero-shot F1. The work provides practical tools for fair model comparisons and for constructing evaluation splits with varying transfer difficulty, along with open-source code and benchmark resources; these contributions advance reliable assessment and data-efficient evaluation in zero-shot NER.

Abstract

Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as 'Person' or 'Medicine') without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familiarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Impact of training data on zero-shot performance of the current state-of-the-art approach (GLiNER). Each synthetic dataset is characterized by the label overlap (yellow column) and the total number of entity mentions (purple column). While zero-shot performance (red line, macro-averaged F1 across 7 benchmarks) has significantly improved, we note a concerning increase in entity type overlaps between training and testing data.
  • Figure 2: With LLMs now capable of generating datasets that cover thousands of entity types, models trained on different datasets are subject to varying label shifts, making comparisons between them challenging. To address this, we introduce Familiarity, a metric that quantifies and accounts for label shift, enabling more accurate and fair comparisons across models.
  • Figure 3: Transfer performance is higher on entity types that occur in both evaluation and fine-tuning datasets compared to unseen types. Further, we observe a positive, log-linear correlation between the number of entity mentions for some entity type and its final performance.
  • Figure 4: Familiarity for different values of $k$ and using different rank weights.
  • Figure 5: Overlapping entity types between considered synthetic training datasets and all evaluation benchmarks.
  • ...and 1 more figures