Table of Contents
Fetching ...

LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task

Khai Le-Duc, Ryan Zhang, Ngoc Son Nguyen, Tan-Hanh Pham, Anh Dao, Ba Hung Ngo, Anh Totti Nguyen, Truong-Son Hy

TL;DR

This work introduces LiteGPT, a unified vision-language framework for joint localization and classification in chest X-rays, addressing a gap where medical VLMs typically target single tasks. It employs multiple frozen visual encoders (BiomedCLIP and PubMedCLIP) fused into a language model (Llama 2-Chat) via a two-stage training regimen that first localizes findings and then diagnoses diseases. The approach yields state-of-the-art classification performance on VinDr-CXR, provides baselines for medical image localization with vision-language models, and demonstrates improved localization and text validity through multi-encoder fusion and careful token design. The framework offers a scalable, modality-rich tool for radiology that can assist clinicians in accurate diagnosis and reporting, with publicly available code and models for broader adoption.

Abstract

Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To the best of our knowledge, this is the first study to utilize vision-language models for the novel task of joint localization and classification in medical images. Besides, we are pioneers in providing baselines for disease localization in chest X-rays. Finally, we set new state-of-the-art performance in the image classification task on the well-benchmarked VinDr-CXR dataset. All code and models are publicly available online: https://github.com/leduckhai/LiteGPT

LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task

TL;DR

This work introduces LiteGPT, a unified vision-language framework for joint localization and classification in chest X-rays, addressing a gap where medical VLMs typically target single tasks. It employs multiple frozen visual encoders (BiomedCLIP and PubMedCLIP) fused into a language model (Llama 2-Chat) via a two-stage training regimen that first localizes findings and then diagnoses diseases. The approach yields state-of-the-art classification performance on VinDr-CXR, provides baselines for medical image localization with vision-language models, and demonstrates improved localization and text validity through multi-encoder fusion and careful token design. The framework offers a scalable, modality-rich tool for radiology that can assist clinicians in accurate diagnosis and reporting, with publicly available code and models for broader adoption.

Abstract

Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To the best of our knowledge, this is the first study to utilize vision-language models for the novel task of joint localization and classification in medical images. Besides, we are pioneers in providing baselines for disease localization in chest X-rays. Finally, we set new state-of-the-art performance in the image classification task on the well-benchmarked VinDr-CXR dataset. All code and models are publicly available online: https://github.com/leduckhai/LiteGPT
Paper Structure (32 sections, 5 equations, 3 figures, 13 tables)

This paper contains 32 sections, 5 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Overview of our proposed method. The model employs multiple visual encoders as its visual backbone, specifically incorporating two different encoders, which remain frozen throughout all training phases. We concatenate five adjacent visual tokens from the output of the visual backbone and project them into the language space of Llama 2. The text is embedded through the embedding layer of Llama 2 and directly concatenated with the visual features to generate the answer.
  • Figure 2: The distribution for a total of 28 findings and diagnoses. The numbers of positive labels were reported based on the assessments of the participating radiologists.
  • Figure 3: The visualization of bounding boxes is shown in Table \ref{['tab7']}. The red rectangles indicate the model's predicted boxes, while the green rectangles indicate the ground truth boxes. In the image on the left side, representing example 1, the red box at the top represents Aortic enlargement, the green box at the top represents Calcification, and the boxes at the bottom represent Cardiomegaly. In the image on the right side, representing example 2, the red and green boxes at the top represent Aortic enlargement, and the boxes at the bottom represent Cardiomegaly.