LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

Ameer Hamza; Abdullah; Yong Hyun Ahn; Sungyoung Lee; Seong Tae Kim

LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

Ameer Hamza, Abdullah, Yong Hyun Ahn, Sungyoung Lee, Seong Tae Kim

TL;DR

This work tackles the challenge of generating accurate and clinically informative natural language explanations for thoracic pathology predictions by integrating a knowledge graph based retrieval augmented generation (KG-RAG) module with vision language models. It introduces three instantiations KG-LLaVA, Med-XPT, and Bio-LLaVA that fuse domain knowledge from a KG with modal visual features to produce high quality NLEs on the MIMIC-NLE dataset. The KG-RAG approach delivers state of the art results across standard NLG metrics and diagnostics while addressing privacy by using de identified KG triplets and latent space retrieval rather than storing patient images. The findings demonstrate that domain specific KG augmentation improves factual correctness and interpretability of radiology explanations, enabling more trustworthy AI assisted diagnostics in clinical workflows.

Abstract

Generating Natural Language Explanations (NLEs) for model predictions on medical images, particularly those depicting thoracic pathologies, remains a critical and challenging task. Existing methodologies often struggle due to general models' insufficient domain-specific medical knowledge and privacy concerns associated with retrieval-based augmentation techniques. To address these issues, we propose a novel Vision-Language framework augmented with a Knowledge Graph (KG)-based datastore, which enhances the model's understanding by incorporating additional domain-specific medical knowledge essential for generating accurate and informative NLEs. Our framework employs a KG-based retrieval mechanism that not only improves the precision of the generated explanations but also preserves data privacy by avoiding direct data retrieval. The KG datastore is designed as a plug-and-play module, allowing for seamless integration with various model architectures. We introduce and evaluate three distinct frameworks within this paradigm: KG-LLaVA, which integrates the pre-trained LLaVA model with KG-RAG; Med-XPT, a custom framework combining MedCLIP, a transformer-based projector, and GPT-2; and Bio-LLaVA, which adapts LLaVA by incorporating the Bio-ViT-L vision model. These frameworks are validated on the MIMIC-NLE dataset, where they achieve state-of-the-art results, underscoring the effectiveness of KG augmentation in generating high-quality NLEs for thoracic pathologies.

LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

TL;DR

Abstract

Paper Structure (33 sections, 9 equations, 2 figures, 11 tables)

This paper contains 33 sections, 9 equations, 2 figures, 11 tables.

Introduction
Related Work
Natural Language Explanation.
Vision-Language Models.
Knowledge Graph.
Retrieval Augmented Generation.
Methodology
Pathology Classification
Knowledge Graph Retrieval
Experiment
Dataset
Implementation Details
Training
Evaluation Metrics
Results and Discussions
...and 18 more sections

Figures (2)

Figure 1: Overview of the KG-LLaVA framework with integrated Knowledge Graph Retrieval Augmented Generation (KG-RAG) module. The framework combines a pre-trained LLaVA model with a CLIP ViT-L vision encoder to extract visual features, which are then projected into the language model's embedding space. The KGR module uses MedCLIP to map input images to a shared latent space and retrieve relevant KG triplets via the FAISS library. These triplets provide domain-specific context that enhances the generation of accurate and informative NLEs for thoracic pathologies. The modular design allows for seamless integration with other architectures, such as Med-XPT and Bio-LLaVA, ensuring flexibility and adaptability across different vision-language tasks.
Figure 2: Comparison of NLEs generated by different models—KG-LLaVA, Med-XPT, and Bio-LLaVA—against the ground truth (GT) for a specific thoracic pathology case. The image depicts a chest X-ray used as input, with the corresponding NLEs. KG-LLaVA accurately matches the GT by identifying the underlying abnormalities, while Bio-LLaVA and Med-XPT offer alternative interpretations, reflecting the models' varying strengths and limitations in clinical reasoning.

LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

TL;DR

Abstract

LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

Authors

TL;DR

Abstract

Table of Contents

Figures (2)