Table of Contents
Fetching ...

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang

TL;DR

A unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning is demonstrated, which can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading.

Abstract

The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning. We first compose a multi-task training dataset comprising 13.4 million instruction and ground-truth pairs (with approximately one million radiographs) for the customized tuning, involving both image- and pixel-level tasks. Thus, we can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading. Finally, we demonstrate the overall superior performance of our model compared to prior arts on various chest X-ray benchmarks across multi-tasks in both direct inference and finetuning settings. Three radiologists further evaluate the generated reports against the recorded ones, which also exhibit the enhanced explainability of our multi-task model.

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

TL;DR

A unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning is demonstrated, which can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading.

Abstract

The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning. We first compose a multi-task training dataset comprising 13.4 million instruction and ground-truth pairs (with approximately one million radiographs) for the customized tuning, involving both image- and pixel-level tasks. Thus, we can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading. Finally, we demonstrate the overall superior performance of our model compared to prior arts on various chest X-ray benchmarks across multi-tasks in both direct inference and finetuning settings. Three radiologists further evaluate the generated reports against the recorded ones, which also exhibit the enhanced explainability of our multi-task model.
Paper Structure (29 sections, 4 equations, 10 figures, 6 tables)

This paper contains 29 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: (a) Overview of the proposed method OmniFM-DR and (b) Pretrain dataset Omni-VQA. The attribute classification extracts disease phrases and related attributes (severity level and location) from the report. (c) Typical example to show the instruction set, composed of multiple downstream tasks in chest X-rays (i.e. disease/attribute classification, localization, segmentation, and report generation). Omni-VQA is utilized for pretraining various tasks.
  • Figure 1: Performance instances of OmniFM-DR and ground truth in four tasks: multi-disease classification, visual grounding, report generation and segmeantation for four labels:(a) Pneumothorax; (b) Pneumonia; (c) Edema; (d) Atelectasis. In the left Chest X-ray image, the green solid line BBOX is the ground truth, and the red-white dashed line BBOX is the region detected by OmniFM-DR. In the right report comparison, the blue highlighted text represents the report generated by OmniFM-DR describing the classified lesions compared to the ground truth report, and the yellow highlighted area represents the matched report describing other categories.CTR and PCR denotes the Cardiothoracic Ratio and Pneumothorax Compress Ratio respectively and has been been detailed descriped in Section Result.
  • Figure 2: (a) Comparisons of entity classification task between OmniFM-DR and other classification models (ConVIRT, GloRIA, BioViL) on the ChestXray14 dataset. (b) Evaluation of attribute classification task in disease severity and location level. ACC score and F1 are utilized for assessing the classfication task, and "mean" is the weighted average of all attribute according to their frequency of occurrence.
  • Figure 2: Typical examples of instruction set for six disease labels: Pneumothorax, Atelectasis, Pneumonia, Pleural Effusion, Condilidation and opacity. The left panel indicates the multiple instruction sets utilized during the training and testing phase. In the Chest X-ray image, the red dash line BBOX denotes the region detected by OmniFM-DR.
  • Figure 3: Typical cases with boundary boxes (a) and comparisons between OmniFM-DR and other disease localization models (b). We assess the disease localization task with ACC and mIoU metrics on the MS-CXR and ChestXray14 datasets. The two datasets share five common diseases: Cardiomegaly, Effusion, Pneumothorax, Atelectasis, Pneumonia. Additionally, the MS-CXR dataset includes three additional disease: Consolidation, Edema, Opacity, while the ChestXray14 dataset includes three additional disease: Infiltrate, Mass, Nodule.
  • ...and 5 more figures