Overview of the PromptCBLUE Shared Task in CHIP2023

Wei Zhu; Xiaoling Wang; Mosha Chen; Buzhou Tang

Overview of the PromptCBLUE Shared Task in CHIP2023

Wei Zhu, Xiaoling Wang, Mosha Chen, Buzhou Tang

TL;DR

PromptCBLUE reformulates the CBLUE medical NLP benchmark into a large-scale, Chinese-language prompt-tuning testbed to evaluate open-source LLMs. It spans 18 tasks across five cohorts and is assessed under two tracks—Parameter-efficient Fine-tuning (PEFT) and In-Context Learning (ICL)—using a unified prompt-response framework and a large pool of prompts. The study finds that 13B backbones with LoRA/QLoRA generally outperform 7B models under PEFT, while ICL remains challenging but benefits from improved demonstration retrieval and knapsack-based selection; data augmentation and chain-of-thought prompting also contribute to gains. Overall, PromptCBLUE provides a practical, privacy-friendly benchmark that informs development of Chinese medical LLMs and highlights effective prompt-tuning and demonstration-selection strategies for real-world MedNLP deployment.

Abstract

This paper presents an overview of the PromptCBLUE shared task (http://cips-chip.org.cn/2023/eval1) held in the CHIP-2023 Conference. This shared task reformualtes the CBLUE benchmark, and provide a good testbed for Chinese open-domain or medical-domain large language models (LLMs) in general medical natural language processing. Two different tracks are held: (a) prompt tuning track, investigating the multitask prompt tuning of LLMs, (b) probing the in-context learning capabilities of open-sourced LLMs. Many teams from both the industry and academia participated in the shared tasks, and the top teams achieved amazing test results. This paper describes the tasks, the datasets, evaluation metrics, and the top systems for both tasks. Finally, the paper summarizes the techniques and results of the evaluation of the various approaches explored by the participating teams.

Overview of the PromptCBLUE Shared Task in CHIP2023

TL;DR

Abstract

Paper Structure (18 sections, 1 figure, 2 tables)

This paper contains 18 sections, 1 figure, 2 tables.

Introduction
Related Work
Medical natural language processing
Parameter-efficient Fine-tuning
In-context learning
Overview of PromptCBLUE
overview
Prompt collection
Response format
Sample format
Dataset splits
Participating Teams and Methods
Participating Teams
Wining teams
Methods of the PEFT track
...and 3 more sections

Figures (1)

Figure 1: We introduce PromptCBLUE, a large-scale instruction tuning benchmark for Chinese medical LLMs, which converts different types of medical natural language processing tasks into a unified prompt-response generation task. PromptCBLUE consists of five cohorts of 18 tasks, which cover a variety of medical applications.

Overview of the PromptCBLUE Shared Task in CHIP2023

TL;DR

Abstract

Overview of the PromptCBLUE Shared Task in CHIP2023

Authors

TL;DR

Abstract

Table of Contents

Figures (1)