Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

Ananya Ganapthy; Praveen Shastry; Naveen Kumarasami; Anandakumar D; Keerthana R; Mounigasri M; Varshinipriya M; Kishore Prasath Venkatesh; Bargava Subramanian; Kalyan Sivasailam

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

Ananya Ganapthy, Praveen Shastry, Naveen Kumarasami, Anandakumar D, Keerthana R, Mounigasri M, Varshinipriya M, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam

TL;DR

The study develops a Vision-Language Model that fuses chest X-ray imaging with clinical notes to automate acute TB screening in resource-limited settings. Using a ViT-based visual encoder (Visual Encoder), a SIGLIP text encoder (Text Encoder), cross-modal attention, and a Gemma-3b transformer decoder, the model generates context-aware diagnostic reports. It achieves high precision and recall for key TB pathologies (consolidation, nodules, cavities/effusions) with AUC values around 0.97–0.99 and strong IoU-based localization, demonstrating robust multimodal diagnostic capability. This approach reduces reliance on radiologists and offers a scalable solution for early TB detection across diverse healthcare settings, with future work focusing on subtle pathology detection, bias mitigation, and richer clinical data integration.

Abstract

Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to generate detailed, context-aware diagnostic reports. The architecture employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring effective representation of acute TB-specific pathologies and clinical insights. Results: Key acute TB pathologies, including consolidation, cavities, and nodules, were detected with high precision (97percent) and recall (96percent). The model demonstrated strong spatial localization capabilities and robustness in distinguishing TB-positive cases, making it a reliable tool for acute TB diagnosis. Conclusion: The multimodal capability of the VLM reduces reliance on radiologists, providing a scalable solution for acute TB screening. Future work will focus on improving the detection of subtle pathologies and addressing dataset biases to enhance its generalizability and application in diverse global healthcare settings.

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

TL;DR

Abstract

Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)