Table of Contents
Fetching ...

PaliGemma-CXR: A Multi-task Multimodal Model for TB Chest X-ray Interpretation

Denis Musinguzi, Andrew Katumba, Sudi Murindanyi

TL;DR

PaliGemma-CXR introduces a unified multi-task multimodal model for TB chest X-ray interpretation that jointly handles diagnosis, detection, segmentation, radiology report generation, and visual question answering. Built on a SigLIP-Gemma backbone, it leverages a curated Ugandan TB X-ray dataset and derived multimodal task data via prompts and VQ-VAE–encoded segmentation, with inverse-dataset-size sampling to mitigate task imbalance. Across five tasks, the model outperforms task-specific and zero-shot baselines, demonstrating strong cross-task transfer and practical prompt-based deployment that reduces the need for bounding boxes in clinical settings. This work advances data-efficient, integrated vision-language models in medical imaging and offers a path toward scalable, low-resource TB screening tools.

Abstract

Tuberculosis (TB) is a infectious global health challenge. Chest X-rays are a standard method for TB screening, yet many countries face a critical shortage of radiologists capable of interpreting these images. Machine learning offers an alternative, as it can automate tasks such as disease diagnosis, and report generation. However, traditional approaches rely on task-specific models, which cannot utilize the interdependence between tasks. Building a multi-task model capable of performing multiple tasks poses additional challenges such as scarcity of multimodal data, dataset imbalance, and negative transfer. To address these challenges, we propose PaliGemma-CXR, a multi-task multimodal model capable of performing TB diagnosis, object detection, segmentation, report generation, and VQA. Starting with a dataset of chest X-ray images annotated with TB diagnosis labels and segmentation masks, we curated a multimodal dataset to support additional tasks. By finetuning PaliGemma on this dataset and sampling data using ratios of the inverse of the size of task datasets, we achieved the following results across all tasks: 90.32% accuracy on TB diagnosis and 98.95% on close-ended VQA, 41.3 BLEU score on report generation, and a mAP of 19.4 and 16.0 on object detection and segmentation, respectively. These results demonstrate that PaliGemma-CXR effectively leverages the interdependence between multiple image interpretation tasks to enhance performance.

PaliGemma-CXR: A Multi-task Multimodal Model for TB Chest X-ray Interpretation

TL;DR

PaliGemma-CXR introduces a unified multi-task multimodal model for TB chest X-ray interpretation that jointly handles diagnosis, detection, segmentation, radiology report generation, and visual question answering. Built on a SigLIP-Gemma backbone, it leverages a curated Ugandan TB X-ray dataset and derived multimodal task data via prompts and VQ-VAE–encoded segmentation, with inverse-dataset-size sampling to mitigate task imbalance. Across five tasks, the model outperforms task-specific and zero-shot baselines, demonstrating strong cross-task transfer and practical prompt-based deployment that reduces the need for bounding boxes in clinical settings. This work advances data-efficient, integrated vision-language models in medical imaging and offers a path toward scalable, low-resource TB screening tools.

Abstract

Tuberculosis (TB) is a infectious global health challenge. Chest X-rays are a standard method for TB screening, yet many countries face a critical shortage of radiologists capable of interpreting these images. Machine learning offers an alternative, as it can automate tasks such as disease diagnosis, and report generation. However, traditional approaches rely on task-specific models, which cannot utilize the interdependence between tasks. Building a multi-task model capable of performing multiple tasks poses additional challenges such as scarcity of multimodal data, dataset imbalance, and negative transfer. To address these challenges, we propose PaliGemma-CXR, a multi-task multimodal model capable of performing TB diagnosis, object detection, segmentation, report generation, and VQA. Starting with a dataset of chest X-ray images annotated with TB diagnosis labels and segmentation masks, we curated a multimodal dataset to support additional tasks. By finetuning PaliGemma on this dataset and sampling data using ratios of the inverse of the size of task datasets, we achieved the following results across all tasks: 90.32% accuracy on TB diagnosis and 98.95% on close-ended VQA, 41.3 BLEU score on report generation, and a mAP of 19.4 and 16.0 on object detection and segmentation, respectively. These results demonstrate that PaliGemma-CXR effectively leverages the interdependence between multiple image interpretation tasks to enhance performance.

Paper Structure

This paper contains 28 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Images from the training dataset. (a) shows an image with active TB, (b) shows a TB negative image, (c) shows an image with latent TB, (d) shows an image with bounding box, and (e) shows an image with segmentation masks.
  • Figure 2: PaliGemma-CXR architecture: SigLIP image encoder feeds into Gemma decoder LM
  • Figure 3: Medical report generated by PaliGemma-CXR. The model identifies consolidation in the right lung field.
  • Figure 4: (a) shows the bounding boxes predicted by PaliGemma-CXR and (b) shows the ground truth bounding boxes, (c) shows the ground truth segmentation mask and (d) shows the segmentation mask generated by PaliGemma-CXR.