Table of Contents
Fetching ...

CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning

Yang Yue, Yulin Wang, Chenxin Tao, Pan Liu, Shiji Song, Gao Huang

TL;DR

CheXWorld addresses the need for robust, domain-robust radiograph representations by introducing a self-supervised world-modeling framework. It jointly learns local tissue detail, global body geometry, and domain variation through three tailored tasks within a unified JEPA-based transformer pipeline. Across eight classification/segmentation benchmarks, it achieves state-of-the-art transfer performance and demonstrates data efficiency, with qualitative analyses showing anatomically meaningful predictions and domain-aware generalization. This work highlights a viable route toward general-purpose medical foundation models trained on radiographs, capable of adapting across domains and tasks.

Abstract

Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models are available at https://github.com/LeapLabTHU/CheXWorld.

CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning

TL;DR

CheXWorld addresses the need for robust, domain-robust radiograph representations by introducing a self-supervised world-modeling framework. It jointly learns local tissue detail, global body geometry, and domain variation through three tailored tasks within a unified JEPA-based transformer pipeline. Across eight classification/segmentation benchmarks, it achieves state-of-the-art transfer performance and demonstrates data efficiency, with qualitative analyses showing anatomically meaningful predictions and domain-aware generalization. This work highlights a viable route toward general-purpose medical foundation models trained on radiographs, capable of adapting across domains and tasks.

Abstract

Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. Specifically, our work develops a unified framework that simultaneously models three aspects of medical knowledge essential for qualified radiologists, including 1) local anatomical structures describing the fine-grained characteristics of local tissues (e.g., architectures, shapes, and textures); 2) global anatomical layouts describing the global organization of the human body (e.g., layouts of organs and skeletons); and 3) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs (e.g., varying clarity, contrast, and exposure caused by collecting radiographs from different hospitals, devices, or patients). Empirically, we design tailored qualitative and quantitative analyses, revealing that CheXWorld successfully captures these three dimensions of medical knowledge. Furthermore, transfer learning experiments across eight medical image classification and segmentation benchmarks showcase that CheXWorld significantly outperforms existing SSL methods and large-scale medical foundation models. Code & pre-trained models are available at https://github.com/LeapLabTHU/CheXWorld.

Paper Structure

This paper contains 21 sections, 13 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of the CheXWorld framework. The upper part of the figure depicts three dimensions of medical knowledge that are formulated in our framework, including (a) local anatomical structures describing the fine-grained characteristics of local tissues, (b) global anatomical layouts describing the global organization of the human body and (c) domain variations that encourage CheXWorld to model the transitions across different appearance domains of radiographs. The middle part of the figure illustrates the world modeling tasks corresponding to these aspects of medical knowledge. (d) shows our unified pipeline that combines the merits of all three tasks.
  • Figure 2: A basic framework of world modeling.
  • Figure 2: Results on segmentation (left) and few-shot learning (right) tasks. The dice score and the AUROC score are reported for the segmentation and few-shot learning benchmarks respectively.
  • Figure 3: Formulation of global anatomical layout modeling.
  • Figure 4: Visualization of the CheXWorld predictor outputs (zooming in for details). The images presented in this figure were not included in the pre-training of CheXWorld or the training of the diffusion model. Regions in red bounding boxes denote the predictor outputs that are mapped to pixel space using the RCDM bordes2021high framework. In (a), gray areas indicate masked regions excluded from the context. In (b), the two overlapping regions alternately serve as context and target.
  • ...and 4 more figures