VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

Jinchao Ge; Tengfei Cheng; Biao Wu; Zeyu Zhang; Shiya Huang; Judith Bishop; Gillian Shepherd; Meng Fang; Ling Chen; Yang Zhao

VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

Jinchao Ge, Tengfei Cheng, Biao Wu, Zeyu Zhang, Shiya Huang, Judith Bishop, Gillian Shepherd, Meng Fang, Ling Chen, Yang Zhao

TL;DR

VaseVL addresses the domain gap in expert-level cultural heritage understanding by pairing VaseVQA, a large-scale, expert-annotated VQA benchmark for ancient Greek pottery, with a two-stage training regime that combines supervised fine-tuning and reinforcement learning guided by verifiable rewards. The reward design jointly optimizes lexical precision and semantic fidelity via $s_{\text{kw}}$ and $s_{\text{sem}}$, balanced by $\tilde{R}(q) = \beta(q) s_{\text{kw}} + (1 - \beta(q)) s_{\text{sem}}$, and amplified for hard categories with $w(q)$ under a GRPO objective. Experimental results show that domain-aligned training (SFT+RL) outperforms larger general-purpose MLLMs, achieving superior reasoning on seven expert-defined task types, with notable gains on high-level questions like Date, Attribution, and Decoration. VaseVQA and VaseVL together offer a reproducible benchmark and methodology for advancing domain-specific visual reasoning in cultural heritage, with implications for archaeology, museums, and education.

Abstract

Understanding cultural heritage artifacts such as ancient Greek pottery requires expert-level reasoning that remains challenging for current MLLMs due to limited domain-specific data. We introduce VaseVQA, a benchmark of 31,773 images and 67,614 question-answer pairs across seven expert-defined categories, enabling systematic evaluation of expert-level cultural heritage understanding. Using this dataset, we explore effective training strategies for domain-specific reasoning. While supervised fine-tuning improves adaptation to domain knowledge, it struggles with deeper reasoning tasks. We propose VaseVL, which augments SFT with reinforcement learning using verifiable rewards. Experiments show that VaseVL consistently outperforms supervised baselines, especially on reasoning-intensive questions, highlighting the value of targeted reinforcement learning for cultural heritage visual question answering. Our code and dataset will be released at https://github.com/AIGeeksGroup/VaseVQA.

VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

TL;DR

Abstract

VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)