Table of Contents
Fetching ...

Technical Report: Small Language Model for Japanese Clinical and Medicine

Shogo Watanabe

TL;DR

This work presents NCVC-slm-1, a 1.2B-parameter Japanese clinical language model trained on high-quality, domain-augmented data to enable local, privacy-preserving clinical NLP. The authors combine a curated Common Corpus and Medicine Textbooks (including synthesized content) and implement a morphology-first tokenizer with Unigram subwords, Rotary PE, and Grouped Query Attention to build a compact yet capable model. Fine-tuning with instruction data yields strong performance on several JMED-LLM tasks, though IgakuQA results and certain tasks remain challenging for a 1B-scale model, underscoring data quality and scale as key factors. The study demonstrates the feasibility of effective domain-specific SLMs for clinical use in settings with constrained compute, while outlining data and methodological paths to improve generalization and real-world applicability.

Abstract

This report presents a small language model (SLM) for Japanese clinical and medicine, named NCVC-slm-1. This 1B parameters model was trained using Japanese text classified to be of high-quality. Moreover, NCVC-slm-1 was augmented with respect to clinical and medicine content that includes the variety of diseases, drugs, and examinations. Using a carefully designed pre-processing, a specialized morphological analyzer and tokenizer, this small and light-weight model performed not only to generate text but also indicated the feasibility of understanding clinical and medicine text. In comparison to other large language models, a fine-tuning NCVC-slm-1 demonstrated the highest scores on 6 tasks of total 8 on JMED-LLM. According to this result, SLM indicated the feasibility of performing several downstream tasks in the field of clinical and medicine. Hopefully, NCVC-slm-1 will be contributed to develop and accelerate the field of clinical and medicine for a bright future.

Technical Report: Small Language Model for Japanese Clinical and Medicine

TL;DR

This work presents NCVC-slm-1, a 1.2B-parameter Japanese clinical language model trained on high-quality, domain-augmented data to enable local, privacy-preserving clinical NLP. The authors combine a curated Common Corpus and Medicine Textbooks (including synthesized content) and implement a morphology-first tokenizer with Unigram subwords, Rotary PE, and Grouped Query Attention to build a compact yet capable model. Fine-tuning with instruction data yields strong performance on several JMED-LLM tasks, though IgakuQA results and certain tasks remain challenging for a 1B-scale model, underscoring data quality and scale as key factors. The study demonstrates the feasibility of effective domain-specific SLMs for clinical use in settings with constrained compute, while outlining data and methodological paths to improve generalization and real-world applicability.

Abstract

This report presents a small language model (SLM) for Japanese clinical and medicine, named NCVC-slm-1. This 1B parameters model was trained using Japanese text classified to be of high-quality. Moreover, NCVC-slm-1 was augmented with respect to clinical and medicine content that includes the variety of diseases, drugs, and examinations. Using a carefully designed pre-processing, a specialized morphological analyzer and tokenizer, this small and light-weight model performed not only to generate text but also indicated the feasibility of understanding clinical and medicine text. In comparison to other large language models, a fine-tuning NCVC-slm-1 demonstrated the highest scores on 6 tasks of total 8 on JMED-LLM. According to this result, SLM indicated the feasibility of performing several downstream tasks in the field of clinical and medicine. Hopefully, NCVC-slm-1 will be contributed to develop and accelerate the field of clinical and medicine for a bright future.

Paper Structure

This paper contains 25 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Data proportions in pre-training corpus
  • Figure 2: The overview of NCVC-slm-1 model architecture
  • Figure 3: The loss logging during self-supervised pre-training
  • Figure 4: Token embedding space by t-SNE (perplexity=15)
  • Figure 5: Token embedding space by UMAP (neighbor=15)
  • ...and 5 more figures