Advancing High Resolution Vision-Language Models in Biomedicine

Zekai Chen; Arda Pekis; Kevin Brown

Advancing High Resolution Vision-Language Models in Biomedicine

Zekai Chen, Arda Pekis, Kevin Brown

TL;DR

The paper tackles the challenge of adapting vision‑language models to biomedicine by introducing a high‑resolution, multi‑scale perception pipeline and a biomedical instruction‑tuning strategy. It presents Llama3‑Med, built on a frozen vision encoder and a trainable connector that interfaces with Llama3, and demonstrates strong zero‑shot performance on three biomedical VQA benchmarks via a two‑stage training regime augmented with synthetic data from Claude3‑Opus and LLaMA3 70B. Key contributions include a novel instruct‑data dataset enriched with medical image‑text pairs, a hierarchical image encoding approach across scales, and a model achieving state‑of‑the‑art zero‑shot results on VQA benchmarks with notable gains over prior methods. The work advances practical biomedical AI by delivering more accurate and reliable multimodal assistants for clinicians, while highlighting the continued need for diverse data, domain‑specific pretraining, and careful consideration of biases and privacy in healthcare applications.

Abstract

Multi-modal learning has significantly advanced generative AI, especially in vision-language modeling. Innovations like GPT-4V and open-source projects such as LLaVA have enabled robust conversational agents capable of zero-shot task completions. However, applying these technologies in the biomedical field presents unique challenges. Recent initiatives like LLaVA-Med have started to adapt instruction-tuning for biomedical contexts using large datasets such as PMC-15M. Our research offers three key contributions: (i) we present a new instruct dataset enriched with medical image-text pairs from Claude3-Opus and LLaMA3 70B, (ii) we propose a novel image encoding strategy using hierarchical representations to improve fine-grained biomedical visual comprehension, and (iii) we develop the Llama3-Med model, which achieves state-of-the-art zero-shot performance on biomedical visual question answering benchmarks, with an average performance improvement of over 10% compared to previous methods. These advancements provide more accurate and reliable tools for medical professionals, bridging gaps in current multi-modal conversational assistants and promoting further innovations in medical AI.

Advancing High Resolution Vision-Language Models in Biomedicine

TL;DR

Abstract

Paper Structure (35 sections, 2 figures, 10 tables)

This paper contains 35 sections, 2 figures, 10 tables.

Introduction
Better Perception of Biomedical Images with Higher Resolutions
Overall Training Paradigm.
Rationale of Data Synthesis.
Experiments
Datasets for Evaluation and Benchmarking.
Evaluation Metrics.
Baselines.
Implementation details.
Results on Biomedical VQAs
Existing SoTA Methods.
Supervised Fine-tuning Results.
Zero-shot Results.
Discussions.
Quality of Generation
...and 20 more sections

Figures (2)

Figure 1: Illustration of building feature embedding in Llama3-Med. High resolution biomedical images are split into multiple smaller pieces that are digestible by existing CLIP image encoders radford2021learning. Embeddings of hierarchical representations are further concatenated and fed into connector for fine-tuning.
Figure 2: Illustration of instruction fine-tuning paradigm. Similar to Li2023LLaVAMedTA, we freeze the image encoder while fine-tuning the connector and LLM base.

Advancing High Resolution Vision-Language Models in Biomedicine

TL;DR

Abstract

Advancing High Resolution Vision-Language Models in Biomedicine

Authors

TL;DR

Abstract

Table of Contents

Figures (2)