Table of Contents
Fetching ...

Enhancing Human Pose Estimation in Ancient Vase Paintings via Perceptually-grounded Style Transfer Learning

Prathmesh Madhu, Angel Villar-Corrales, Ronak Kosti, Torsten Bendschus, Corinna Reinhardt, Peter Bell, Andreas Maier, Vincent Christlein

TL;DR

This work tackles the poor cross-domain generalisation of human pose estimation to ancient Greek vase paintings. It introduces a two-stage approach: first, a perceptually grounded AdaIN-based style transfer to create Styled-COCO-Persons (SCP) data that mimics vase painting style, and second, fine-tuning on a small ClassArch (CA) dataset with pose annotations. The method yields substantial improvements in pose estimation on unlabelled data (over 6% increases in mAP and mAR) and further gains when fine-tuned on CA, supported by ablations showing learning of generic domain styles and effective use of perceptual loss. Additionally, the authors demonstrate pose-based image retrieval in art collections, highlighting the practical impact for cultural heritage analytics with minimal labeling.

Abstract

Human pose estimation (HPE) is a central part of understanding the visual narration and body movements of characters depicted in artwork collections, such as Greek vase paintings. Unfortunately, existing HPE methods do not generalise well across domains resulting in poorly recognized poses. Therefore, we propose a two step approach: (1) adapting a dataset of natural images of known person and pose annotations to the style of Greek vase paintings by means of image style-transfer. We introduce a perceptually-grounded style transfer training to enforce perceptual consistency. Then, we fine-tune the base model with this newly created dataset. We show that using style-transfer learning significantly improves the SOTA performance on unlabelled data by more than 6% mean average precision (mAP) as well as mean average recall (mAR). (2) To improve the already strong results further, we created a small dataset (ClassArch) consisting of ancient Greek vase paintings from the 6-5th century BCE with person and pose annotations. We show that fine-tuning on this data with a style-transferred model improves the performance further. In a thorough ablation study, we give a targeted analysis of the influence of style intensities, revealing that the model learns generic domain styles. Additionally, we provide a pose-based image retrieval to demonstrate the effectiveness of our method.

Enhancing Human Pose Estimation in Ancient Vase Paintings via Perceptually-grounded Style Transfer Learning

TL;DR

This work tackles the poor cross-domain generalisation of human pose estimation to ancient Greek vase paintings. It introduces a two-stage approach: first, a perceptually grounded AdaIN-based style transfer to create Styled-COCO-Persons (SCP) data that mimics vase painting style, and second, fine-tuning on a small ClassArch (CA) dataset with pose annotations. The method yields substantial improvements in pose estimation on unlabelled data (over 6% increases in mAP and mAR) and further gains when fine-tuned on CA, supported by ablations showing learning of generic domain styles and effective use of perceptual loss. Additionally, the authors demonstrate pose-based image retrieval in art collections, highlighting the practical impact for cultural heritage analytics with minimal labeling.

Abstract

Human pose estimation (HPE) is a central part of understanding the visual narration and body movements of characters depicted in artwork collections, such as Greek vase paintings. Unfortunately, existing HPE methods do not generalise well across domains resulting in poorly recognized poses. Therefore, we propose a two step approach: (1) adapting a dataset of natural images of known person and pose annotations to the style of Greek vase paintings by means of image style-transfer. We introduce a perceptually-grounded style transfer training to enforce perceptual consistency. Then, we fine-tune the base model with this newly created dataset. We show that using style-transfer learning significantly improves the SOTA performance on unlabelled data by more than 6% mean average precision (mAP) as well as mean average recall (mAR). (2) To improve the already strong results further, we created a small dataset (ClassArch) consisting of ancient Greek vase paintings from the 6-5th century BCE with person and pose annotations. We show that fine-tuning on this data with a style-transferred model improves the performance further. In a thorough ablation study, we give a targeted analysis of the influence of style intensities, revealing that the model learns generic domain styles. Additionally, we provide a pose-based image retrieval to demonstrate the effectiveness of our method.

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Attic red-figure, Zeus and Ganymede, in ancient Greek vase paintings: \ref{['fig:mainfig1']} original image, \ref{['fig:mainfig2']} pose estimation by OpenPose Cao_RealTimePoseEstimationAffinityFields_2017, \ref{['fig:mainfig3']} our method.
  • Figure 2: (1 column) Divine pursuit scene in ancient Greek vase paintings. The central character, a winged persecutor, is depicted with a similar pose, i. e., arms extended towards the right (observer viewpoint) and legs with large strides. (2 column) Leading the bride scene, with the central character bride depicted with similar poses with her left hand extended forward (observer viewpoint) held by the groom. (3 column) Abduction scene, where character on the left is abducting the character on the right. (4 & 5 columns) Wrestling in Agonal and Mythological contexts between two main characters.
  • Figure 3: Style transfer using AdaINHuang_ArbitraryStyleTransferAdaptiveInstanceNormalization_2017 with full style intensity ($\alpha=1$). AdaIN adjusts the first and second order moments of the 'Content Image' to match those of the 'Style Image'. A 'Styled Image' (style-transferred) is generated with the semantic content of the 'content image' and style of the 'Style Image'.
  • Figure 4: Dataset Samples:\ref{['fig:dataset_cp']} Images & \ref{['fig:dataset_cp_labels']} Labels of CP dataset; \ref{['fig:dataset_scp_alpha05']} & \ref{['fig:dataset_scp_alphau']} are samples from the SCP dataset with $\alpha=0.5$ and $\alpha=U$ respectively; \ref{['fig:dataset_ca']} shows images with \ref{['fig:dataset_ca_labels']} the corresponding labels of our CA dataset. Each labelled example shows the corresponding person bounding boxes and their pose keypoints.
  • Figure 5: (first row, Top-down pose estimation) - (A*) styled person detector detects all instances, (B*) for which the body joint locations are predicted using a person keypoint detector, (C*) The pose skeletons are assembled by connecting the detected keypoints for each person. 2 Step Training Approach:Step 1 (second row, Styled Models) Person Detector trained on SCP persons data, and HRNet on SCP poses data; Step 2 (third row, Styled-Tuned Models) Styled Person Detector from second row is fine-tuned on CA persons data, and Styled HRNet is fine-tuned on CA pose data.
  • ...and 2 more figures