Overview of the PromptCBLUE Shared Task in CHIP2023
Wei Zhu, Xiaoling Wang, Mosha Chen, Buzhou Tang
TL;DR
PromptCBLUE reformulates the CBLUE medical NLP benchmark into a large-scale, Chinese-language prompt-tuning testbed to evaluate open-source LLMs. It spans 18 tasks across five cohorts and is assessed under two tracks—Parameter-efficient Fine-tuning (PEFT) and In-Context Learning (ICL)—using a unified prompt-response framework and a large pool of prompts. The study finds that 13B backbones with LoRA/QLoRA generally outperform 7B models under PEFT, while ICL remains challenging but benefits from improved demonstration retrieval and knapsack-based selection; data augmentation and chain-of-thought prompting also contribute to gains. Overall, PromptCBLUE provides a practical, privacy-friendly benchmark that informs development of Chinese medical LLMs and highlights effective prompt-tuning and demonstration-selection strategies for real-world MedNLP deployment.
Abstract
This paper presents an overview of the PromptCBLUE shared task (http://cips-chip.org.cn/2023/eval1) held in the CHIP-2023 Conference. This shared task reformualtes the CBLUE benchmark, and provide a good testbed for Chinese open-domain or medical-domain large language models (LLMs) in general medical natural language processing. Two different tracks are held: (a) prompt tuning track, investigating the multitask prompt tuning of LLMs, (b) probing the in-context learning capabilities of open-sourced LLMs. Many teams from both the industry and academia participated in the shared tasks, and the top teams achieved amazing test results. This paper describes the tasks, the datasets, evaluation metrics, and the top systems for both tasks. Finally, the paper summarizes the techniques and results of the evaluation of the various approaches explored by the participating teams.
