Table of Contents
Fetching ...

HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

Hongji Yang, Yucheng Zhou, Wencheng Han, Songlian Li, Xiaotong Zhao, Jianbing Shen

TL;DR

The proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation, and can serve as a critical backbone for Test-time Scaling in image generation.

Abstract

While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.

HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

TL;DR

The proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation, and can serve as a critical backbone for Test-time Scaling in image generation.

Abstract

While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.
Paper Structure (30 sections, 10 equations, 13 figures, 4 tables)

This paper contains 30 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The motivation of our study. (a) Generic VLMs suffer from artistic misalignment, evaluation gaps, and low-quality data when handling domain-specific artistic content such as Chinese paintings; (b) The expert-level VLMs introduce the structured expert chain-of-thought, to produce reliable final scores; (c) Experimental results show that our CoT and RL significantly improve performance.
  • Figure 2: The overall framework of HanMoVLM. For visual artistic understanding, we finetune HanMoVLM with expert-level CoT and expert answer via SFT, and then perform RFT based on the expert reward function. For visual artistic generation, we apply the frozen HanMoVLM as the external verifier to evaluate the images sampled from existing T2I models.
  • Figure 3: The expert-level Chain-of-Thought in our HanMoVLM. When the generic VLM evaluates Chinese paintings, the model tends to generate outputs based on general knowledge (non-expert) rather than following professional guidelines. As a result, its analysis does not reflect the experts' procedures, leading to a decline in terms of professional level, reliability, hallucination, and human logic.
  • Figure 4: Composition of HanMo-Bench.
  • Figure 5: The construction pipeline of HanMo-Bench.
  • ...and 8 more figures