Table of Contents
Fetching ...

VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Nonghai Zhang, Zeyu Zhang, Jiazi Wang, Yang Zhao, Hao Tang

TL;DR

The paper tackles the scarcity of 3D cultural heritage data by constructing VaseVQA-3D, the first 3D visual question answering benchmark for ancient Greek pottery, paired with 664 high-quality GLB models and 4,460 QA pairs. It introduces VaseVLM, a domain-adapted VLM trained through a two-stage process: LoRA-based supervised fine-tuning on 360-degree vase views and OCR-inspired archaeological captions, followed by Reinforcement Learning with Verifiable Rewards (RLVR) to optimize multi-dimensional archaeological reasoning. A key contribution is VaseEval, a 24-model 3D generation quality set used to compare 3D reconstruction methods and select TripoSG for large-scale data synthesis, with a reward framework defined by $R = \sum_{i=1}^{6} w_i r_i - P + B$ and dimension-specific similarities. Empirical results show VaseVLM-7B-RL achieving notable improvements in R@1 and lexical similarity over baselines, underscoring the value of domain-specific training and RL in cultural heritage VQA. The work advances digital heritage preservation by providing a practical, scalable pathway for 3D artifact understanding and cross-disciplinary AI collaboration, supported by a publicly available codebase.

Abstract

Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research. Code: https://github.com/AIGeeksGroup/VaseVQA-3D. Website: https://aigeeksgroup.github.io/VaseVQA-3D.

VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

TL;DR

The paper tackles the scarcity of 3D cultural heritage data by constructing VaseVQA-3D, the first 3D visual question answering benchmark for ancient Greek pottery, paired with 664 high-quality GLB models and 4,460 QA pairs. It introduces VaseVLM, a domain-adapted VLM trained through a two-stage process: LoRA-based supervised fine-tuning on 360-degree vase views and OCR-inspired archaeological captions, followed by Reinforcement Learning with Verifiable Rewards (RLVR) to optimize multi-dimensional archaeological reasoning. A key contribution is VaseEval, a 24-model 3D generation quality set used to compare 3D reconstruction methods and select TripoSG for large-scale data synthesis, with a reward framework defined by and dimension-specific similarities. Empirical results show VaseVLM-7B-RL achieving notable improvements in R@1 and lexical similarity over baselines, underscoring the value of domain-specific training and RL in cultural heritage VQA. The work advances digital heritage preservation by providing a practical, scalable pathway for 3D artifact understanding and cross-disciplinary AI collaboration, supported by a publicly available codebase.

Abstract

Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research. Code: https://github.com/AIGeeksGroup/VaseVQA-3D. Website: https://aigeeksgroup.github.io/VaseVQA-3D.

Paper Structure

This paper contains 30 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Captions in VaseVQA-3D dataset. Each GLB-format 3D vase is rendered in four canonical views—front, back, top, and bottom—and is accompanied by a concise caption that records decorative motifs, manufacturing technique, provenance, and current repository
  • Figure 2: QA in VaseVQA-3D dataset. Each data entry contains high-quality 3D vase models, structured question-answer pairs, and GPT-4 enhanced descriptive captions, providing comprehensive support for multimodal understanding of ancient Greek pottery.
  • Figure 3: Complete Data Quality Filtering Pipeline. The figure shows our comprehensive filtering methodology, including ResNet-50-based quality assessment for removing low-quality images, followed by dual CLIP-based semantic filtering for fragment removal and optimal image selection.
  • Figure 4: 3D Generation Methods Comparison. Comparison of TripoSG and Hunyuan3D generation effects based on the VaseEval validation set. TripoSG performs better in mesh quality, and although Hunyuan3D has advantages in texture mapping effects, TripoSG-generated models are closer to ground truth, thus selected for large-scale dataset construction.
  • Figure 5: Complete Pipeline for Vase Dataset Construction. The pipeline progresses from initial data collection (30K+ images) through quality filtering (664 images), 3D generation (664 models), QA construction (9K pairs), to final model training. Each component includes specific quality control mechanisms and validation procedures.
  • ...and 2 more figures