Table of Contents
Fetching ...

BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter

TL;DR

This work tackles coreference resolution in biomedical texts by benchmarking generative LLM prompting strategies against a SpanBERT baseline on the CRAFT corpus. It defines four prompting configurations—local-only, local with reference context, abbreviation-aware, and entity-aware—and evaluates their ability to resolve pronouns, definite/indefinite noun phrases, and abbreviations. Results show that domain-grounded prompts can markedly improve recall and F1, with smaller LLaMA models (8B and 17B) frequently outperforming the 70B variant, highlighting the importance of prompt design over sheer model size. The study also demonstrates that structured inputs like abbreviation and entity dictionaries provide meaningful gains, while long-range context remains challenging, pointing to avenues for hybrid systems and external knowledge integration.

Abstract

Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

TL;DR

This work tackles coreference resolution in biomedical texts by benchmarking generative LLM prompting strategies against a SpanBERT baseline on the CRAFT corpus. It defines four prompting configurations—local-only, local with reference context, abbreviation-aware, and entity-aware—and evaluates their ability to resolve pronouns, definite/indefinite noun phrases, and abbreviations. Results show that domain-grounded prompts can markedly improve recall and F1, with smaller LLaMA models (8B and 17B) frequently outperforming the 70B variant, highlighting the importance of prompt design over sheer model size. The study also demonstrates that structured inputs like abbreviation and entity dictionaries provide meaningful gains, while long-range context remains challenging, pointing to avenues for hybrid systems and external knowledge integration.

Abstract

Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the coreference resolution pipeline under four prompting strategies. Each chunk is processed by an LLM independently (Exp. 1), with prior context (Exp. 2), or with auxiliary inputs such as abbreviation (Exp. 3) or entity dictionaries (Exp. 4).
  • Figure 2: Distribution of word distances between coreferent mentions in biomedical texts, grouped into four ranges.
  • Figure 3: Heatmap of precision, recall, and F1 scores for LLaMA models (70B, 17B, 8B) across four experimental setups (LOCAL, REF CTX, ABBR, ENTITY) and coreference categories (pronouns, indefinite NPs, abbreviations, definite NPs).
  • Figure 4: Extracted coreference type counts by model and context.