Table of Contents
Fetching ...

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang

TL;DR

QoQ-Med introduces a generalist clinical multimodal foundation model that reasons across images, ECG time-series, and text. It relies on Domain-aware Relative Policy Optimization (DRPO) to balance learning across heterogeneous clinical domains, enabling robust performance and interpretable reasoning traces. Trained on 2.61 million QA pairs across 9 domains, QoQ-Med achieves large gains in macro-F1 across visual modalities and IoU-backed bounding-box reasoning, while demonstrating strong multimodal fusion on MIMIC-IV. The work provides open access to model weights, training pipelines, and reasoning traces to promote reproducibility and downstream clinical AI research.

Abstract

Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

TL;DR

QoQ-Med introduces a generalist clinical multimodal foundation model that reasons across images, ECG time-series, and text. It relies on Domain-aware Relative Policy Optimization (DRPO) to balance learning across heterogeneous clinical domains, enabling robust performance and interpretable reasoning traces. Trained on 2.61 million QA pairs across 9 domains, QoQ-Med achieves large gains in macro-F1 across visual modalities and IoU-backed bounding-box reasoning, while demonstrating strong multimodal fusion on MIMIC-IV. The work provides open access to model weights, training pipelines, and reasoning traces to promote reproducibility and downstream clinical AI research.

Abstract

Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

Paper Structure

This paper contains 39 sections, 23 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: (a) Overview of QoQ-Med. The training corpus spans 11 clinical domains, including structured waveforms (e.g., ECG), diverse imaging modalities, electronic health records, and curated clinical QA pairs. Modality‑specific encoders convert inputs into token embeddings that are linearly projected into a common space and interleaved with text tokens before entering the LLM backbone. The model then autoregressively produces (i) an explainable chain‑of‑thought, (ii) bounding‑box annotations highlighting salient regions, and (iii) a concise clinical diagnosis. (b) Overview of DRPO Training. DRPO builds on top of the critic-free RL training method GRPO. The model's answer is first rated by a reward model before going through standard normalization. Then, a clustering-based scaling is performed on top of domain-wise scaling, both of which encourage the model to focus on scarce, hard examples across domains.
  • Figure 2: (a) Difference in accuracy (DRPO - GRPO). DRPO brings the most performance gain in understudied modalities as defined in App. \ref{['sec:novel_definition']}. (b) Accuracy comparison of QoQ-Med against SoTA open source and closed source LLMs. QoQ-Med outperforms all open and closed MLLMs across 8 domains. The full results are included in App. Table \ref{['tab:model_comparison']}.
  • Figure 3: (a) Accuracy of ECG Diagnosis. DRPO models reach the best performance among all critic-free RL methods. (b) Intersection over Union (IoU) of model-generated bounding boxes against truth labels. QoQ-Med (Ours) surpasses open source models and has a performance on par with o4-mini. (c) Per Step Runtime of reward calculation of RL methods on 8xA100 GPUs. While DRPO adds hierarchical clustering, the runtime of the reward calculation still accounts for less than 2% of the total runtime per step and has minimal impact on training.
  • Figure 4: Model outputs annotated by clinical experts. QoQ-Med correctly reasons from modality-specific clinical knowledge, generates bounding boxes, and outputs the correct predictions in most instances except (c). (e) demonstrates the model's ability to synthesize multimodal inputs with reasoning. The bounding boxes correctly highlight the salient regions related to the reasoning steps when one is present.
  • Figure 5: Comparison of DRPO and GRPO on Balanced Datasets. Acc: Accuracy, F1: F1 Score. To remove the influence of imbalanced dataset, we further conducted a experiment on a balanced subset of the 30 datasets, where each dataset share the same portion in the training dataset mix. This helps us compare our method with similar methods like loss scaling (i.e. focal losslin2017focal) and upsampling/downsampling techniques. Thanks to the dynamic weighting based on both difficulty and scarcity, our method better captures the changing dynamics throughout the training, allowing it to perform better than GRPO even with a perfectly balanced dataset.
  • ...and 8 more figures