Table of Contents
Fetching ...

ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding

Oishi Banerjee, Sung Eun Kim, Alexandra N. Willauer, Julius M. Kernbach, Abeer Rihan Alomaish, Reema Abdulwahab S. Alghamdi, Hassan Rayhan Alomaish, Mohammed Baharoon, Xiaoman Zhang, Julian Nicolas Acosta, Christine Zhou, Pranav Rajpurkar

Abstract

Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.

ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding

Abstract

Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.
Paper Structure (10 sections, 4 figures, 1 table)

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Medical photographs pose several unique challenges at the intersection of medical reasoning and natural image interpretation. Left: For example, medical models face a substantial domain shift from specialized imaging modalities ho2026scoliosisradiopaedia to photographs. In addition, photographs taken outside the clinic are likelier to have technical flaws that obscure fine-grained medical details (e.g. the abnormal lower-lid eyelashes at the bottom of "Technical Variation"). Right: ReXInTheWild tests whether models can assess the medical content in diverse natural images, including both healthy subjects and patients with medical conditions.
  • Figure 2: The ReXInTheWild benchmark construction pipeline. Stage A: "Image Selection" filters PubMed Central images by caption and visual content to identify suitable medical photographs. Stage B: "Question Generation" uses GPT-5 to produce candidate questions and applies automated editing to remove leading cues and refine distractors. Stage C: "Automatic and Expert Verification" automatically scores clarity and difficulty and routes high-scoring questions for a two-stage expert review process.
  • Figure 3: Large general-purpose models, especially Gemini-3, outperformed MedGemma, a smaller medical MLLM. Performance varied across medical categories, though model rankings generally remained consistent within each category.
  • Figure 4: Failure modes range from low-level errors in basic image interpretation to high-level failures in causal reasoning. Notably, models made errors when describing normal physiology (top), not only when assessing medical abnormalities.