JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Junfeng Jiang; Jiahao Huang; Akiko Aizawa

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Junfeng Jiang, Jiahao Huang, Akiko Aizawa

TL;DR

Experimental results indicate that LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, and there is still much room for improving the existing LLMs in certain Japanese biomedical tasks.

Abstract

Recent developments in Japanese large language models (LLMs) primarily focus on general domains, with fewer advancements in Japanese biomedical LLMs. One obstacle is the absence of a comprehensive, large-scale benchmark for comparison. Furthermore, the resources for evaluating Japanese biomedical LLMs are insufficient. To advance this field, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Experimental results indicate that: (1) LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, (2) LLMs that are not mainly designed for Japanese biomedical domains can still perform unexpectedly well, and (3) there is still much room for improving the existing LLMs in certain Japanese biomedical tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available in https://huggingface.co/datasets/Coldog2333/JMedBench to facilitate future research.

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 6 figures, 14 tables)

This paper contains 29 sections, 6 figures, 14 tables.

Introduction
Related Works
JMedBench
Datasets
Evaluation Dataset Augmentation
Multi-choice Question-Answering
Named Entity Recognition
Evaluation Protocols
Experiments
Comparison Methods
Experimental Results
Multi-choice Question-Answering
Named Entity Recognition
Machine Translation
Document Classification
...and 14 more sections

Figures (6)

Figure 1: Overview of JMedBench
Figure 2: Zero-shot and few-shot performances on different tasks in JMedBench.
Figure 3: Zero-shot performance under different prompt templates.
Figure 4: Few-shot performance under different prompt templates.
Figure 5: Zero-shot and few-shot performance over time of all involved LLMs.
...and 1 more figures

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

TL;DR

Abstract

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)