Table of Contents
Fetching ...

Scene Depth Estimation from Traditional Oriental Landscape Paintings

Sungho Kang, YeongHyeon Park, Hyunkyu Park, Juneho Yi

TL;DR

This work addresses depth estimation for traditional oriental landscape paintings, a challenging task due to non-traditional perspective cues and preservation issues. It introduces a two-step Image-to-Image translation pipeline augmented by CLIP-based image matching to generate a semantically matched real-scene image, which is then depth-estimated with a pre-trained model such as MiDaS. The approach uses a CLIP-matched dataset built with a predefined dictionary of landscape objects and a 1-to-$K$ matching scheme, training CycleGAN in the first step to produce a pseudo-real scene image and DiffuseIT in the second step to refine it into a realistic real-scene image for depth estimation. Experimental results, including qualitative evaluations, an ablation study, and a user study, show that the method preserves structural fidelity and yields depth maps usable for creating tactile sculptures, marking a first step toward enabling visually impaired audiences to experience oriental paintings.

Abstract

Scene depth estimation from paintings can streamline the process of 3D sculpture creation so that visually impaired people appreciate the paintings with tactile sense. However, measuring depth of oriental landscape painting images is extremely challenging due to its unique method of depicting depth and poor preservation. To address the problem of scene depth estimation from oriental landscape painting images, we propose a novel framework that consists of two-step Image-to-Image translation method with CLIP-based image matching at the front end to predict the real scene image that best matches with the given oriental landscape painting image. Then, we employ a pre-trained SOTA depth estimation model for the generated real scene image. In the first step, CycleGAN converts an oriental landscape painting image into a pseudo-real scene image. We utilize CLIP to semantically match landscape photo images with an oriental landscape painting image for training CycleGAN in an unsupervised manner. Then, the pseudo-real scene image and oriental landscape painting image are fed into DiffuseIT to predict a final real scene image in the second step. Finally, we measure depth of the generated real scene image using a pre-trained depth estimation model such as MiDaS. Experimental results show that our approach performs well enough to predict real scene images corresponding to oriental landscape painting images. To the best of our knowledge, this is the first study to measure the depth of oriental landscape painting images. Our research potentially assists visually impaired people in experiencing paintings in diverse ways. We will release our code and resulting dataset.

Scene Depth Estimation from Traditional Oriental Landscape Paintings

TL;DR

This work addresses depth estimation for traditional oriental landscape paintings, a challenging task due to non-traditional perspective cues and preservation issues. It introduces a two-step Image-to-Image translation pipeline augmented by CLIP-based image matching to generate a semantically matched real-scene image, which is then depth-estimated with a pre-trained model such as MiDaS. The approach uses a CLIP-matched dataset built with a predefined dictionary of landscape objects and a 1-to- matching scheme, training CycleGAN in the first step to produce a pseudo-real scene image and DiffuseIT in the second step to refine it into a realistic real-scene image for depth estimation. Experimental results, including qualitative evaluations, an ablation study, and a user study, show that the method preserves structural fidelity and yields depth maps usable for creating tactile sculptures, marking a first step toward enabling visually impaired audiences to experience oriental paintings.

Abstract

Scene depth estimation from paintings can streamline the process of 3D sculpture creation so that visually impaired people appreciate the paintings with tactile sense. However, measuring depth of oriental landscape painting images is extremely challenging due to its unique method of depicting depth and poor preservation. To address the problem of scene depth estimation from oriental landscape painting images, we propose a novel framework that consists of two-step Image-to-Image translation method with CLIP-based image matching at the front end to predict the real scene image that best matches with the given oriental landscape painting image. Then, we employ a pre-trained SOTA depth estimation model for the generated real scene image. In the first step, CycleGAN converts an oriental landscape painting image into a pseudo-real scene image. We utilize CLIP to semantically match landscape photo images with an oriental landscape painting image for training CycleGAN in an unsupervised manner. Then, the pseudo-real scene image and oriental landscape painting image are fed into DiffuseIT to predict a final real scene image in the second step. Finally, we measure depth of the generated real scene image using a pre-trained depth estimation model such as MiDaS. Experimental results show that our approach performs well enough to predict real scene images corresponding to oriental landscape painting images. To the best of our knowledge, this is the first study to measure the depth of oriental landscape painting images. Our research potentially assists visually impaired people in experiencing paintings in diverse ways. We will release our code and resulting dataset.
Paper Structure (14 sections, 6 equations, 6 figures, 1 table)

This paper contains 14 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: An overview of our method. Direct application of pre-trained SOTA depth estimation models to the given oriental landscape painting image, $I_{ori}$, ends up with a meaningless depth map, 'Depth map of $I_{ori}$'. We propose a novel framework that consists of CLIP-based image matching and I2I translation in two steps. Through CLIP-based image matching, we initially obtain pairs of oriental landscape painting image and its corresponding landscape photo image. For the I2I translation in two steps, we employ CycleGAN gi2i02 and DIffuseITdi2i01. In the first step, CycleGAN gi2i02 that is trained on the CLIP-matched dataset translates an oriental landscape painting image into a pseudo-real scene image. To create a real scene image, the generated pseudo-real scene image and the given oriental landscape painting image are fed into DiffuseIT di2i01 in the second step. The produced real scene image enables measuring depth using a pre-trained depth estimation model such as MiDaS MiDaS. The measured depth map can be directly applied to craft 3D sculptures.
  • Figure 2: Examples of oriental landscape paintings from museums mus01mus02mus04. Oriental landscape paintings have been expressed in various styles due to their long historyoriental09oriental10. An unique technique called 'three-way method' is employed to create a sense of perspectiveoriental02oriental04oriental06oriental07. However, oriental landscape paintings often lack consistency in portraying objects due to the use of multiple types of 'three-way method' in a single painting. Moreover, oriental landscape paintings often contain empty spaces to depict depth of paintingsoriental11oriental05. Additionally, oriental landscape paintings exhibit poor preservation conditions.
  • Figure 3: An image matching method using CLIPCLIP01. We employ CLIP CLIP01 to semantically match oriental landscape painting images with landscape photo images. To build a pre-defined dictionary for CLIP CLIP01, we carefully selected frequently appearing objects in oriental landscape paintings by referencing papers oriental08oriental12 and collections from museums mus01mus02mus03mus04mus05. The pre-defined dictionary is fed into the text encoder of CLIPCLIP01, while oriental landscape painting images are input into the image encoder to measure similarity. Simultaneously, all landscape photo images in the LHQ dataset LHQ are fed into another image encoder for measuring similarity. By comparing the similarity, we match an oriental landscape painting image with the top-$K$ landscape photo images most similar to them. The top-$K$ most similar landscape photo images matched to the given oriental landscape painting images are used to create the CLIP-matched dataset for training CycleGAN gi2i02 to generate more plausible pseudo-real scene images.
  • Figure 4: For qualitative evaluation, we compared our method with other I2I translation models. Initially, MiDaS MiDaS, a pre-trained depth estimation model, failed to accurately estimate the depth of oriental landscape painting images despite its strong generalization ability. While GAN-based I2I translation methods gi2i02gi2i05gi2i11 preserved the structural information of these painting images, they struggled to achieve realistic translations. CAST gi2i08 and VQ-I2I gi2i09 not only failed to produce realistic translations but also lost structural fidelity. Diffusion-based I2I models such as DiffuseIT di2i01 and BBDM di2i04 managed to realistically transform oriental landscape painting images into real scene images, albeit with huge distortion in structural information. In contrast, our method preserve structural information like CycleGAN gi2i02, CUTgi2i05 or DRIT++ gi2i11 while achieving realistic translations akin to BBDM di2i04 or DiffuseITdi2i01. Our approach successfully enables depth measurement using a pre-trained MiDaSMiDaS for the given oriental landscape painting images.
  • Figure 5: The w/o CycleGANgi2i02 cases show significant structural distortion during translation. When DiffuseITdi2i01 is missing, predicted images maintain structure of oriental landscape painting images but it fails to realistically convert to real scene images. The w/ CLIP-based image matching cases demonstrate that CLIP-based image matching make CycleGANgi2i02 for producing high-quality pseudo-real scene image. Our method shows that employing CLIP-based image matching, CycleGANgi2i02, and DiffuseITdi2i01 generate more plausible real scene images. Best viewed in color.
  • ...and 1 more figures