Table of Contents
Fetching ...

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

Mario Sänger, Samuele Garda, Xing David Wang, Leon Weber-Genzel, Pia Droop, Benedikt Fuchs, Alan Akbik, Ulf Leser

TL;DR

The paper evaluates end-to-end biomedical named entity recognition and normalization across cross-corpus scenarios, challenging the reliability of published in-corpus results for real-world use. It systematically benchmarks five mature tools (BERN2, bent, PubTator, SciSpacy, HunFlair2) on three corpora with four entity types, revealing substantial performance drops in cross-corpus settings. HunFlair2 emerges as the top performer on average, but across entities, results vary and far from ideal, underscoring issues in KB mappings, annotation guidelines, and rare entities. The study emphasizes the need for robust, generalizable BTM tools and cautions users about diminishing performance when applying tools to unseen data, advocating further research on cross-domain generalization and fair evaluation practices.

Abstract

With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied in the wild, i.e., on application-dependent text collections different from those used for the tools' training, varying, e.g., in focus, genre, style, and text type. This raises the question of whether the reported performance of BTM tools can be trusted for downstream applications. Here, we report on the results of a carefully designed cross-corpus benchmark for named entity extraction, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five for an in-depth analysis on three publicly available corpora encompassing four different entity types. Comparison between tools results in a mixed picture and shows that, in a cross-corpus setting, the performance is significantly lower than the one reported in an in-corpus setting. HunFlair2 showed the best performance on average, being closely followed by PubTator. Our results indicate that users of BTM tools should expect diminishing performances when applying them in the wild compared to original publications and show that further research is necessary to make BTM tools more robust.

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

TL;DR

The paper evaluates end-to-end biomedical named entity recognition and normalization across cross-corpus scenarios, challenging the reliability of published in-corpus results for real-world use. It systematically benchmarks five mature tools (BERN2, bent, PubTator, SciSpacy, HunFlair2) on three corpora with four entity types, revealing substantial performance drops in cross-corpus settings. HunFlair2 emerges as the top performer on average, but across entities, results vary and far from ideal, underscoring issues in KB mappings, annotation guidelines, and rare entities. The study emphasizes the need for robust, generalizable BTM tools and cautions users about diminishing performance when applying tools to unseen data, advocating further research on cross-domain generalization and fair evaluation practices.

Abstract

With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied in the wild, i.e., on application-dependent text collections different from those used for the tools' training, varying, e.g., in focus, genre, style, and text type. This raises the question of whether the reported performance of BTM tools can be trusted for downstream applications. Here, we report on the results of a carefully designed cross-corpus benchmark for named entity extraction, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five for an in-depth analysis on three publicly available corpora encompassing four different entity types. Comparison between tools results in a mixed picture and shows that, in a cross-corpus setting, the performance is significantly lower than the one reported in an in-corpus setting. HunFlair2 showed the best performance on average, being closely followed by PubTator. Our results indicate that users of BTM tools should expect diminishing performances when applying them in the wild compared to original publications and show that further research is necessary to make BTM tools more robust.
Paper Structure (32 sections, 2 figures, 13 tables)

This paper contains 32 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 2: Ablation study results: (a) Performance comparison of the five tools concerning three evaluation settings: end-to-end NER and NEN, NER using a strict and a lenient evaluation setting, i.e., we count each prediction as true positive which is a sub- or superstring of gold standard entity mention. (b) Comparison of the mention- and document-level end-to-end NER and NEN results of the five tools.
  • Figure 3: Overview of the overlaps of the true positive predictions of BERN2, HunFlair2, PubTator, SciSpacy, and bent concerning the different entity types. For each setting, we report the total number of true positives (TP) found by at least one tool as well as the number of false positives (FP) in the sub-title. For gene and tmVar (v3) we exclude Scispacy from the analysis as it does not support normalization of gene mentions.