Table of Contents
Fetching ...

Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

TL;DR

This work tackles the problem of image screening by focusing on image aesthetic reasoning in multimodal large language models. It introduces a large-scale image screening dataset (≈128k samples, ≈640k images) with four evaluation dimensions and a two-stage training pipeline that first builds spatial understanding via CoT data and then applies HCM-GRPO reinforcement fine-tuning with a Dynamic Proportional Accuracy reward. The proposed method significantly enhances performance on compact models, achieving a 64.74 overall score on InternVL3-2B-CoT, outperforming several large open-source and leading closed-source models, and showing generalization to real-world public datasets. The work provides a scalable data-and-methodology framework for robust image screening in multimodal systems, with practical implications for e-commerce and AIGC governance. Future directions include richer CoT data and unsupervised knowledge transfer to enable broader domain adaptation.

Abstract

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

TL;DR

This work tackles the problem of image screening by focusing on image aesthetic reasoning in multimodal large language models. It introduces a large-scale image screening dataset (≈128k samples, ≈640k images) with four evaluation dimensions and a two-stage training pipeline that first builds spatial understanding via CoT data and then applies HCM-GRPO reinforcement fine-tuning with a Dynamic Proportional Accuracy reward. The proposed method significantly enhances performance on compact models, achieving a 64.74 overall score on InternVL3-2B-CoT, outperforming several large open-source and leading closed-source models, and showing generalization to real-world public datasets. The work provides a scalable data-and-methodology framework for robust image screening in multimodal systems, with practical implications for e-commerce and AIGC governance. Future directions include richer CoT data and unsupervised knowledge transfer to enable broader domain adaptation.

Abstract

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of the image aesthetic dataset and quantitative comparison results. (a) We summarize four evaluation dimensions of image aesthetic from the dataset. (b) Extensive quantitative comparison results demonstrate the superiority of our HCM-GRPO method in the image screening task.
  • Figure 2: Overview of dataset construction pipeline.
  • Figure 3: Presentation of different annotation paradigms.
  • Figure 4: Illustration of model training process.
  • Figure 5: Samples from the constructed dataset.
  • ...and 1 more figures