Table of Contents
Fetching ...

A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

Hermann Kroll, Pascal Sackhoff, Bill Matthias Thang, Maha Ksouri, Wolf-Tilo Balke

TL;DR

This work focuses on relation extraction and text classification, using the showcase of eight biomedical benchmarks, and considers tradeoffs between accuracy and application costs.

Abstract

Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines. Therefore, we focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.

A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

TL;DR

This work focuses on relation extraction and text classification, using the showcase of eight biomedical benchmarks, and considers tradeoffs between accuracy and application costs.

Abstract

Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines. Therefore, we focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.

Paper Structure

This paper contains 26 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Task 1 (RE). Label distribution of each benchmark.
  • Figure 2: Task 1 (RE). Hyperparameter search distribution between the best and worst models for the two best shallow models and language models comparing the accuracy score.
  • Figure 3: Task 2 (TC). Label distribution for each benchmark.
  • Figure 4: (TC). Hyperparameter search sistribution between the best and worst models for the two best shallow models and language models comparing the accuracy score.
  • Figure 5: Task 2 (TC). Achieved F1 scores are shown for the test sets when training data is reduced.
  • ...and 1 more figures