Table of Contents
Fetching ...

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang

TL;DR

MedXIAOHE introduces a medical vision-language foundation model designed for real-world clinical reasoning across text, images, OCR, and long-form reports. It advances through an entity-aware continual pretraining framework built on the Medical Entity Tree (MET), dense knowledge-centric data synthesis, and tool-augmented agentic training to enable multi-step diagnostic reasoning with verifiable traces. A unified evaluation backbone, the Unified Med-VLM Benchmark, standardizes prompting, scoring, and decontamination across 30+ public and in-house benchmarks, emphasizing reliability, faithfulness, and deployment relevance. The approach demonstrates strong multi-domain performance, improved coverage of long-tail medical concepts, and robust reasoning and grounding capabilities, aiming to bridge benchmark success with clinical usability and safety.

Abstract

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

TL;DR

MedXIAOHE introduces a medical vision-language foundation model designed for real-world clinical reasoning across text, images, OCR, and long-form reports. It advances through an entity-aware continual pretraining framework built on the Medical Entity Tree (MET), dense knowledge-centric data synthesis, and tool-augmented agentic training to enable multi-step diagnostic reasoning with verifiable traces. A unified evaluation backbone, the Unified Med-VLM Benchmark, standardizes prompting, scoring, and decontamination across 30+ public and in-house benchmarks, emphasizing reliability, faithfulness, and deployment relevance. The approach demonstrates strong multi-domain performance, improved coverage of long-tail medical concepts, and robust reasoning and grounding capabilities, aiming to bridge benchmark success with clinical usability and safety.

Abstract

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
Paper Structure (72 sections, 1 equation, 14 figures, 8 tables)

This paper contains 72 sections, 1 equation, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Performance comparison of MedXIAOHE against SOTA models on comprehensive medical benchmarks. The left panel shows the overall average score across 30+ benchmarks, demonstrating the strong performance of MedXIAOHE. The right panels detail the comparative results across six key capabilities. In the upper-left bar chart, dark bars represent scores on public benchmarks, and light bars represent scores on in-house benchmarks. We did not evaluate in-house benchmarks on Gemini 3.0 Pro because of changes in its privacy protocols.
  • Figure 2: The architecture of MedXIAOHE. The model utilizes a Multimodal Native-Resolution Transformer to process diverse medical imaging modalities (e.g., X-ray, CT, Pathology) with varying resolutions and aspect ratios. Visual features encoded by Seed-ViT are projected via an MLP Adapter and interleaved with text tokens, integrating medical knowledge and patient records to support multi-turn dialogue and reasoning-based generation.
  • Figure 3: Data-cleaning pipeline for the continual pretraining corpus. The pipeline comprises two main stages: a text-cleaning workflow and a multimodal data production workflow. To construct a high-quality pretraining corpus, we apply a combination of hash-based deduplication, rule-based filtering, and model-based quality control.
  • Figure 4: Architecture overview of the Medical Entity Tree. This hierarchical taxonomy organizes medical concepts into aligned categories to facilitate balanced entity training, precise knowledge coverage quantification, and entity-driven data acquisition.
  • Figure 5: Mid-training data construction overview. The framework illustrates the comprehensive pipeline designed to synthesize high-fidelity medical reasoning data from diverse sources. a, The data synthesis engine aggregates unsupervised and supervised corpora, utilizing knowledge graphs and multi-agent consensus to construct structured reasoning datasets. b, A multi-expert reject sampling mechanism with dual-quality gates is employed to distill diverse and causally valid reasoning trajectories. c, The process incorporates a structured Chain-of-Thought construction pipeline with automatic quality checks, strictly aligning visual perception with logical deduction to eliminate hallucinations.
  • ...and 9 more figures