Table of Contents
Fetching ...

CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

Xiao Liang, Yuxuan An, Di Wang, Jiawei Hu, Zhicheng Jiao, Bin Jing, Quan Wang

TL;DR

Medical Vision-Language Models suffer from hallucinations, undermining clinical reliability. CheXPO-v2 introduces a verifiable reinforcement learning framework that replaces outcome-only rewards with fine-grained process supervision via a Knowledge Graph Consistency Reward and Entity-Relation Matching, paired with hard example mining. The approach leverages a large-scale chest X-ray instructional dataset and a two-stage warm-up plus GRPO training to achieve state-of-the-art results on MIMIC-CXR-VQA and Medical-Diff-VQA with only 5k preference samples. This yields clinically sound, verifiable reasoning with high data efficiency, advancing safe deployment of radiology VLMs; code is publicly available.

Abstract

Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

TL;DR

Medical Vision-Language Models suffer from hallucinations, undermining clinical reliability. CheXPO-v2 introduces a verifiable reinforcement learning framework that replaces outcome-only rewards with fine-grained process supervision via a Knowledge Graph Consistency Reward and Entity-Relation Matching, paired with hard example mining. The approach leverages a large-scale chest X-ray instructional dataset and a two-stage warm-up plus GRPO training to achieve state-of-the-art results on MIMIC-CXR-VQA and Medical-Diff-VQA with only 5k preference samples. This yields clinically sound, verifiable reasoning with high data efficiency, advancing safe deployment of radiology VLMs; code is publicly available.

Abstract

Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.

Paper Structure

This paper contains 29 sections, 13 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A medical vision-language models's generated response can contain factual errors (e.g., misjudging the angle as "sharply defined") and hard-to-verify, irrelevant information (e.g., "osteophytes").
  • Figure 2: Examples from the multi-task QA dataset. Each entry includes a question $\mathcal{Q}$ and a ground-truth reference $\mathcal{R}$. This reference contains a Chain-of-Thought (CoT) $\mathcal{T}$ enclosed by <think> tags, followed by the final answer $\mathcal{A}$ within <answer> tags. The HTML]00CC00"annotation area" highlights the visual prompt on the image.
  • Figure 3: Overview of our Knowledge Graph Consistency Framework. This two-stage pipeline enhances MedVLM reasoning via Group-Relative Policy Optimization. High-value training examples are selected using Hard Example Mining. The reward signal is determined by the Knowledge Graph Consistency Reward, which uses Anatomy-Disease Entity and Relation Extraction to provide fine-grained process supervision.
  • Figure 4: Type-wise accuracy (%) on chest X-ray VQA across different sampling strategies. The default setting uses 1k samples and a composite reward of answer correctness and Entity-Relation Matching (Jaccard).
  • Figure 5: Comparison of distribution between SFT failure cases (a) and the original dataset (b), with the average token-level probability of the content between <answer> and </answer>.
  • ...and 4 more figures