Table of Contents
Fetching ...

ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue System Development

Ta Duc Huy, Nguyen Anh Tu, Tran Hoang Vu, Nguyen Phuc Minh, Nguyen Phan, Trung H. Bui, Steven Q. H. Truong

TL;DR

ViMQ introduces a Vietnamese medical question dataset with joint NER and IC annotations to advance healthcare dialogue systems. It presents Hierarchical Supervisors Seeding for annotation and a span-noise based online self-supervised training strategy that yields consistent gains on ViMQ and a COVID-19 Vietnamese NER benchmark. Baseline PhoBERT-based IC and NER models (with BIO tagging and CRF) are used to establish a solid NLU foundation, and the proposed methods enhance robustness to annotation noise. The dataset enables development of a practical NLU module for healthcare chatbots, capable of deconstructing user questions into actionable entities and intents to retrieve medical answers or route to clinicians, with code and data to be published.

Abstract

Existing medical text datasets usually take the form of question and answer pairs that support the task of natural language generation, but lacking the composite annotations of the medical terms. In this study, we publish a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. The tag sets for two tasks are in medical domain and can facilitate the development of task-oriented healthcare chatbots with better comprehension of queries from patients. We train baseline models for the two tasks and propose a simple self-supervised training strategy with span-noise modelling that substantially improves the performance. Dataset and code will be published at https://github.com/tadeephuy/ViMQ

ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue System Development

TL;DR

ViMQ introduces a Vietnamese medical question dataset with joint NER and IC annotations to advance healthcare dialogue systems. It presents Hierarchical Supervisors Seeding for annotation and a span-noise based online self-supervised training strategy that yields consistent gains on ViMQ and a COVID-19 Vietnamese NER benchmark. Baseline PhoBERT-based IC and NER models (with BIO tagging and CRF) are used to establish a solid NLU foundation, and the proposed methods enhance robustness to annotation noise. The dataset enables development of a practical NLU module for healthcare chatbots, capable of deconstructing user questions into actionable entities and intents to retrieve medical answers or route to clinicians, with code and data to be published.

Abstract

Existing medical text datasets usually take the form of question and answer pairs that support the task of natural language generation, but lacking the composite annotations of the medical terms. In this study, we publish a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. The tag sets for two tasks are in medical domain and can facilitate the development of task-oriented healthcare chatbots with better comprehension of queries from patients. We train baseline models for the two tasks and propose a simple self-supervised training strategy with span-noise modelling that substantially improves the performance. Dataset and code will be published at https://github.com/tadeephuy/ViMQ
Paper Structure (17 sections, 1 equation, 1 figure, 5 tables)

This paper contains 17 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Examples from ViMQ dataset. Each block includes the IC tag (right side) and the NER tags of an example from the dataset (above the horizontal line) and its English translation (below the horizontal line).