Table of Contents
Fetching ...

GROK: From Quantitative Biomarkers to Qualitative Diagnosis via a Grounded MLLM with Knowledge-Guided Instruction

Zhuangzhi Gao, Hongyi Qin, He Zhao, Qinkai Yu, Feixiang Zhou, Eduard Shantsila, Uazman Alam, Alena Shantsila, Wahbi El-Bouri, Gregory Y. H. Lip, Yalin Zheng

TL;DR

GROK tackles the challenge of making ophthalmic diagnosis from CFP and OCT both interpretable and clinically grounded. It introduces a three-stage pipeline—Knowledge-Guided Instruction Generation, OCT–Biomarker Alignment, and Supervised Instruction Finetuning—coupled with dual encoders and a cross-modal fusion module to translate quantitative biomarkers into qualitative diagnostic reasoning. On the Grounded Ophthalmic Understanding benchmark, GROK outperforms open-source 7B and 32B baselines across macro-diagnostic metrics, report quality, and fine-grained clinical assessments, while matching or approaching proprietary models in several dimensions. The work demonstrates that domain-aligned encoders and knowledge-guided data generation significantly enhance interpretability and diagnostic fidelity, with future work aiming to expand datasets, add modalities, and pursue larger-scale pretraining and alignment.

Abstract

Multimodal large language models (MLLMs) hold promise for integrating diverse data modalities, but current medical adaptations such as LLaVA-Med often fail to fully exploit the synergy between color fundus photography (CFP) and optical coherence tomography (OCT), and offer limited interpretability of quantitative biomarkers. We introduce GROK, a grounded multimodal large language model that jointly processes CFP, OCT, and text to deliver clinician-grade diagnoses of ocular and systemic disease. GROK comprises three core modules: Knowledge-Guided Instruction Generation, CLIP-Style OCT-Biomarker Alignment, and Supervised Instruction Fine-Tuning, which together establish a quantitative-to-qualitative diagnostic chain of thought, mirroring real clinical reasoning when producing detailed lesion annotations. To evaluate our approach, we introduce the Grounded Ophthalmic Understanding benchmark, which covers six disease categories and three tasks: macro-level diagnostic classification, report generation quality, and fine-grained clinical assessment of the generated chain of thought. Experiments show that, with only LoRA (Low-Rank Adaptation) fine-tuning of a 7B-parameter Qwen2 backbone, GROK outperforms comparable 7B and 32B baselines on both report quality and fine-grained clinical metrics, and even exceeds OpenAI o3. Code and data are publicly available in the GROK repository.

GROK: From Quantitative Biomarkers to Qualitative Diagnosis via a Grounded MLLM with Knowledge-Guided Instruction

TL;DR

GROK tackles the challenge of making ophthalmic diagnosis from CFP and OCT both interpretable and clinically grounded. It introduces a three-stage pipeline—Knowledge-Guided Instruction Generation, OCT–Biomarker Alignment, and Supervised Instruction Finetuning—coupled with dual encoders and a cross-modal fusion module to translate quantitative biomarkers into qualitative diagnostic reasoning. On the Grounded Ophthalmic Understanding benchmark, GROK outperforms open-source 7B and 32B baselines across macro-diagnostic metrics, report quality, and fine-grained clinical assessments, while matching or approaching proprietary models in several dimensions. The work demonstrates that domain-aligned encoders and knowledge-guided data generation significantly enhance interpretability and diagnostic fidelity, with future work aiming to expand datasets, add modalities, and pursue larger-scale pretraining and alignment.

Abstract

Multimodal large language models (MLLMs) hold promise for integrating diverse data modalities, but current medical adaptations such as LLaVA-Med often fail to fully exploit the synergy between color fundus photography (CFP) and optical coherence tomography (OCT), and offer limited interpretability of quantitative biomarkers. We introduce GROK, a grounded multimodal large language model that jointly processes CFP, OCT, and text to deliver clinician-grade diagnoses of ocular and systemic disease. GROK comprises three core modules: Knowledge-Guided Instruction Generation, CLIP-Style OCT-Biomarker Alignment, and Supervised Instruction Fine-Tuning, which together establish a quantitative-to-qualitative diagnostic chain of thought, mirroring real clinical reasoning when producing detailed lesion annotations. To evaluate our approach, we introduce the Grounded Ophthalmic Understanding benchmark, which covers six disease categories and three tasks: macro-level diagnostic classification, report generation quality, and fine-grained clinical assessment of the generated chain of thought. Experiments show that, with only LoRA (Low-Rank Adaptation) fine-tuning of a 7B-parameter Qwen2 backbone, GROK outperforms comparable 7B and 32B baselines on both report quality and fine-grained clinical metrics, and even exceeds OpenAI o3. Code and data are publicly available in the GROK repository.

Paper Structure

This paper contains 17 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Motivating example. On paired CFP & OCT images, existing MLLMs (Lingshu-32B, Qwen2.5VL-32B, OpenAI-o3; left) each fail one of three criteria—quantitative biomarker analysis, qualitative diagnosis (sub-inference), or alignment with the final conclusion (✗). Our pipeline (right) meets all three (✓), converting measurements into coherent clinical reasoning.
  • Figure 2: Illustration of GROK's model architecture. It consists of two clip-style vision encoders (for CFP and OCT), followed by projection layers that align the visual features with the text embedding space before being processed by the LLM. The model’s effectiveness relies on a three-stage training pipeline: Knowledge-Guided Instruction Generation, OCT–Biomarker Alignment, and Supervised Instruction Fine-tuning.
  • Figure 3: Knowledge-Guided Instruction Generation: Eye-Guideline prompts and OpenAI-o3 convert CFP/OCT biomarkers into a grounded chain-of-thought report. CLIP-Style OCT-biomarker alignment: A contrastive loss aligns 2-D central-foveal OCT slices with their 3-D biomarker vectors, yielding clinically grounded OCT features. Cross-Modal Fusion: Modality-specific projectors embed CFP and OCT features into a shared language space, enabling Qwen2 to generate the final diagnosis.
  • Figure 4: Comparison of diagnostic reports generated by GROK and OpenAI-o3 on retinal fundus images from a diabetic retinopathy case.