Table of Contents
Fetching ...

Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

Zheng Sun, Yi Wei, Long Yu

TL;DR

The paper tackles the challenge of image aesthetic reasoning in medical image screening with Multimodal Large Language Models, addressing data scarcity and poor reasoning capabilities. It introduces a 1500+ sample medical image screening dataset covering four aesthetic dimensions and a two-stage training pipeline combining CoT-based supervised fine-tuning with DPA-GRPO reinforcement learning to enhance reasoning. The results show that state-of-the-art models perform near random on this task, while a smaller model trained with DPA-GRPO achieves 55.98 overall, outperforming large open-source and closed-source baselines and highlighting the efficacy of the proposed reward design. This work provides a practical configuration for improving image aesthetic reasoning in MLLMs and lays groundwork for safer and more reliable AI-assisted medical image screening.

Abstract

Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.

Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs

TL;DR

The paper tackles the challenge of image aesthetic reasoning in medical image screening with Multimodal Large Language Models, addressing data scarcity and poor reasoning capabilities. It introduces a 1500+ sample medical image screening dataset covering four aesthetic dimensions and a two-stage training pipeline combining CoT-based supervised fine-tuning with DPA-GRPO reinforcement learning to enhance reasoning. The results show that state-of-the-art models perform near random on this task, while a smaller model trained with DPA-GRPO achieves 55.98 overall, outperforming large open-source and closed-source baselines and highlighting the efficacy of the proposed reward design. This work provides a practical configuration for improving image aesthetic reasoning in MLLMs and lays groundwork for safer and more reliable AI-assisted medical image screening.

Abstract

Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit{(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality}. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.

Paper Structure

This paper contains 15 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the Image Aesthetic Dataset and Comparison Results. (a) We summarize four dimensions of image aesthetic judgment from the dataset. (b) Extensive quantitative comparison results demonstrate the superiority of our DPA-GRPO in medical image screening tasks.
  • Figure 2: Overview of data construction pipeline and proportion of different evaluation dimensions in the dataset.
  • Figure 3: Samples form Dataset.
  • Figure 4: The presentation of the reasoning process.