ArtMentor: AI-Assisted Evaluation of Artworks to Explore Multimodal Large Language Models Capabilities
Chanjin Zheng, Zengyi Yu, Yilin Jiang, Mingzi Zhang, Xunuo Lu, Jing Jin, Liteng Gao
TL;DR
The paper tackles the challenge of evaluating multimodal large language models in art education by introducing ArtMentor, a process‑oriented HCI space that collects process data from teacher–ML interactions across nine art‑evaluation dimensions. It leverages a multi‑agent architecture (E‑Agent, R‑Agent, S‑Agent) and an HCI dataset of 380 sessions to quantify MLLM capabilities through integrated ML, NLP, and HCI metrics, while enabling iterative upgrades. Key contributions include (1) the ArtMentor space and its freely accessible dataset, (2) a holistic evaluation framework combining entity recognition, style assessment, scoring, and text generation, and (3) empirical findings on GPT‑4o’s strengths and areas for improvement in perception, understanding, and reasoning within art evaluation. The work advances a robust, process‑oriented approach to evaluating AI in education and art, with practical implications for deploying AI copilots in classroom settings and guiding future model refinements.
Abstract
Can Multimodal Large Language Models (MLLMs), with capabilities in perception, recognition, understanding, and reasoning, function as independent assistants in art evaluation dialogues? Current MLLM evaluation methods, which rely on subjective human scoring or costly interviews, lack comprehensive coverage of various scenarios. This paper proposes a process-oriented Human-Computer Interaction (HCI) space design to facilitate more accurate MLLM assessment and development. This approach aids teachers in efficient art evaluation while also recording interactions for MLLM capability assessment. We introduce ArtMentor, a comprehensive space that integrates a dataset and three systems to optimize MLLM evaluation. The dataset consists of 380 sessions conducted by five art teachers across nine critical dimensions. The modular system includes agents for entity recognition, review generation, and suggestion generation, enabling iterative upgrades. Machine learning and natural language processing techniques ensure the reliability of evaluations. The results confirm GPT-4o's effectiveness in assisting teachers in art evaluation dialogues. Our contributions are available at https://artmentor.github.io/.
