Table of Contents
Fetching ...

A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru

TL;DR

This work tackles the problem of end-to-end biomedical relation extraction without task-specific training data by benchmarking seven diverse RE datasets with frontier LLMs (GPT-4, o1, GPT-OSS-120B) using JSON-output prompts. It demonstrates that zero-shot RE can approach supervised performance on shorter, simpler instances, with GPT-OSS-120B offering the strongest overall results and GPT-4 lagging on several datasets, especially those with many relation types. The study highlights systematic error patterns such as under-prediction in dense relation instances and boundary-mismatch errors, and it shows that long, dense inputs (e.g., BioRED) remain challenging. By releasing prompts, datasets, and code, the paper provides a practical benchmark and points to future directions like improved entity boundary extraction and dataset re-annotation to realize robust, scalable biomedical knowledge base population via ZSRE.

Abstract

Extracting relations from scientific literature is a fundamental task in biomedical NLP because entities and relations among them drive hypothesis generation and knowledge discovery. As literature grows rapidly, relation extraction (RE) is indispensable to curate knowledge graphs to be used as computable structured and symbolic representations. With the rise of LLMs, it is pertinent to examine if it is better to skip tailoring supervised RE methods, save annotation burden, and just use zero shot RE (ZSRE) via LLM API calls. In this paper, we propose a benchmark with seven biomedical RE datasets with interesting characteristics and evaluate three Open AI models (GPT-4, o1, and GPT-OSS-120B) for end-to-end ZSRE. We show that LLM-based ZSRE is inching closer to supervised methods in performances on some datasets but still struggles on complex inputs expressing multiple relations with different predicates. Our error analysis reveals scope for improvements.

A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

TL;DR

This work tackles the problem of end-to-end biomedical relation extraction without task-specific training data by benchmarking seven diverse RE datasets with frontier LLMs (GPT-4, o1, GPT-OSS-120B) using JSON-output prompts. It demonstrates that zero-shot RE can approach supervised performance on shorter, simpler instances, with GPT-OSS-120B offering the strongest overall results and GPT-4 lagging on several datasets, especially those with many relation types. The study highlights systematic error patterns such as under-prediction in dense relation instances and boundary-mismatch errors, and it shows that long, dense inputs (e.g., BioRED) remain challenging. By releasing prompts, datasets, and code, the paper provides a practical benchmark and points to future directions like improved entity boundary extraction and dataset re-annotation to realize robust, scalable biomedical knowledge base population via ZSRE.

Abstract

Extracting relations from scientific literature is a fundamental task in biomedical NLP because entities and relations among them drive hypothesis generation and knowledge discovery. As literature grows rapidly, relation extraction (RE) is indispensable to curate knowledge graphs to be used as computable structured and symbolic representations. With the rise of LLMs, it is pertinent to examine if it is better to skip tailoring supervised RE methods, save annotation burden, and just use zero shot RE (ZSRE) via LLM API calls. In this paper, we propose a benchmark with seven biomedical RE datasets with interesting characteristics and evaluate three Open AI models (GPT-4, o1, and GPT-OSS-120B) for end-to-end ZSRE. We show that LLM-based ZSRE is inching closer to supervised methods in performances on some datasets but still struggles on complex inputs expressing multiple relations with different predicates. Our error analysis reveals scope for improvements.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: (Left) The average number of GPT-4 predicted relations per test instance is plotted against the number of gold relation in the instance for the CDR dataset. The line $y = x$ is overlayed for ease of interpretation. (Right) Recall is calculated for subsets of the data by the number of gold relations.