Scene Depth Estimation from Traditional Oriental Landscape Paintings
Sungho Kang, YeongHyeon Park, Hyunkyu Park, Juneho Yi
TL;DR
This work addresses depth estimation for traditional oriental landscape paintings, a challenging task due to non-traditional perspective cues and preservation issues. It introduces a two-step Image-to-Image translation pipeline augmented by CLIP-based image matching to generate a semantically matched real-scene image, which is then depth-estimated with a pre-trained model such as MiDaS. The approach uses a CLIP-matched dataset built with a predefined dictionary of landscape objects and a 1-to-$K$ matching scheme, training CycleGAN in the first step to produce a pseudo-real scene image and DiffuseIT in the second step to refine it into a realistic real-scene image for depth estimation. Experimental results, including qualitative evaluations, an ablation study, and a user study, show that the method preserves structural fidelity and yields depth maps usable for creating tactile sculptures, marking a first step toward enabling visually impaired audiences to experience oriental paintings.
Abstract
Scene depth estimation from paintings can streamline the process of 3D sculpture creation so that visually impaired people appreciate the paintings with tactile sense. However, measuring depth of oriental landscape painting images is extremely challenging due to its unique method of depicting depth and poor preservation. To address the problem of scene depth estimation from oriental landscape painting images, we propose a novel framework that consists of two-step Image-to-Image translation method with CLIP-based image matching at the front end to predict the real scene image that best matches with the given oriental landscape painting image. Then, we employ a pre-trained SOTA depth estimation model for the generated real scene image. In the first step, CycleGAN converts an oriental landscape painting image into a pseudo-real scene image. We utilize CLIP to semantically match landscape photo images with an oriental landscape painting image for training CycleGAN in an unsupervised manner. Then, the pseudo-real scene image and oriental landscape painting image are fed into DiffuseIT to predict a final real scene image in the second step. Finally, we measure depth of the generated real scene image using a pre-trained depth estimation model such as MiDaS. Experimental results show that our approach performs well enough to predict real scene images corresponding to oriental landscape painting images. To the best of our knowledge, this is the first study to measure the depth of oriental landscape painting images. Our research potentially assists visually impaired people in experiencing paintings in diverse ways. We will release our code and resulting dataset.
