Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai; Singo Sakashita; Shumpei Ishikawa; Shogo Watanabe; Anna Matsuoka; Mikio Sakurai; Yasuto Fujimoto; Yoshiyuki Takahara; Atsushi Ohara; Hirohiko Miyake; Genichiro Ishii

Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai, Yasuto Fujimoto, Yoshiyuki Takahara, Atsushi Ohara, Hirohiko Miyake, Genichiro Ishii

Abstract

The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.

Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Abstract

Paper Structure (13 sections, 2 figures, 6 tables)

This paper contains 13 sections, 2 figures, 6 tables.

Introduction
Materials and Methods
Models and execution environment
Benchmark A: formatted report generation and information extraction
Benchmark B: typo correction
Benchmark C: subjective evaluation of explanatory text
Hyperparameters, code availability, and ethics
Results
Structured reporting and extraction
Typo correction
Subjective evaluation of explanatory text
Discussion
Conclusion

Figures (2)

Figure 1: Distribution of explanatory-text ratings by model among five pathologists (a) and three clinicians (b).
Figure 2: Model-level inter-rater reliability for explanatory-text evaluation. Panels show ICC(2,1) and ICC(2, k) of pathologists (a and b). Those of clinicians (c and d). 95% confidence intervals are depicted with bars.

Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Abstract

Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Authors

Abstract

Table of Contents

Figures (2)