Table of Contents
Fetching ...

SketchRef: a Multi-Task Evaluation Benchmark for Sketch Synthesis

Xingyue Lin, Xingjian Hu, Shuai Peng, Jianhua Zhu, Liangcai Gao

TL;DR

SketchRef addresses the lack of standardized evaluation for sketch synthesis by introducing a unified, multi-task benchmark that leverages shared structure between sketches and reference photos. It defines two tasks—category prediction and structural consistency estimation—across four domains and introduces the mean recognizability under simplification ($mRS$) to balance recognizability with simplicity. A pose-alignment-based $R_s$ metric based on keypoint correspondences via OKS and a relative simplicity measure SR enable fair, cross-method comparison. Evaluations on eight sketch-synthesis methods show that strong category recognizability does not imply structural fidelity and highlight the need for structure-preserving training. The benchmark is validated with $7{,}920$ human responses and provides a practical framework to advance sketch synthesis research.

Abstract

Sketching is a powerful artistic technique for capturing essential visual information about real-world objects and has increasingly attracted attention in image synthesis research. However, the field lacks a unified benchmark to evaluate the performance of various synthesis methods. To address this, we propose SketchRef, the first comprehensive multi-task evaluation benchmark for sketch synthesis. SketchRef fully leverages the shared characteristics between sketches and reference photos. It introduces two primary tasks: category prediction and structural consistency estimation, the latter being largely overlooked in previous studies. These tasks are further divided into five sub-tasks across four domains: animals, common things, human body, and faces. Recognizing the inherent trade-off between recognizability and simplicity in sketches, we are the first to quantify this balance by introducing a recognizability calculation method constrained by simplicity, mRS, ensuring fair and meaningful evaluations. To validate our approach, we collected 7,920 responses from art enthusiasts, confirming the effectiveness of our proposed evaluation metrics. Additionally, we evaluate the performance of existing sketch synthesis methods on our benchmark, highlighting their strengths and weaknesses. We hope this study establishes a standardized benchmark and offers valuable insights for advancing sketch synthesis algorithms.

SketchRef: a Multi-Task Evaluation Benchmark for Sketch Synthesis

TL;DR

SketchRef addresses the lack of standardized evaluation for sketch synthesis by introducing a unified, multi-task benchmark that leverages shared structure between sketches and reference photos. It defines two tasks—category prediction and structural consistency estimation—across four domains and introduces the mean recognizability under simplification () to balance recognizability with simplicity. A pose-alignment-based metric based on keypoint correspondences via OKS and a relative simplicity measure SR enable fair, cross-method comparison. Evaluations on eight sketch-synthesis methods show that strong category recognizability does not imply structural fidelity and highlight the need for structure-preserving training. The benchmark is validated with human responses and provides a practical framework to advance sketch synthesis research.

Abstract

Sketching is a powerful artistic technique for capturing essential visual information about real-world objects and has increasingly attracted attention in image synthesis research. However, the field lacks a unified benchmark to evaluate the performance of various synthesis methods. To address this, we propose SketchRef, the first comprehensive multi-task evaluation benchmark for sketch synthesis. SketchRef fully leverages the shared characteristics between sketches and reference photos. It introduces two primary tasks: category prediction and structural consistency estimation, the latter being largely overlooked in previous studies. These tasks are further divided into five sub-tasks across four domains: animals, common things, human body, and faces. Recognizing the inherent trade-off between recognizability and simplicity in sketches, we are the first to quantify this balance by introducing a recognizability calculation method constrained by simplicity, mRS, ensuring fair and meaningful evaluations. To validate our approach, we collected 7,920 responses from art enthusiasts, confirming the effectiveness of our proposed evaluation metrics. Additionally, we evaluate the performance of existing sketch synthesis methods on our benchmark, highlighting their strengths and weaknesses. We hope this study establishes a standardized benchmark and offers valuable insights for advancing sketch synthesis algorithms.
Paper Structure (14 sections, 5 equations, 3 figures, 4 tables)

This paper contains 14 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our proposed dataset. The left image shows the data and annotations we cover, as well as sketches synthesized from our data. It can be observed that some of the synthesized sketches miss important structures. For example, in the face sketch synthesized by PhotoSketch li2019photo, the eyebrows and mouth are missing. In the human sketch synthesized by CLIPasso vinker2022clipasso, the right leg is absent. We use keypoints as a bridge to quantify these structural errors. The right table compares the evaluation datasets used in our benchmark method with those of previous methods. It can be seen that our dataset covers a wider range of domains and includes a significantly larger volume of data.
  • Figure 2: The trade-off between recognizability and simplicity in sketches. (a) Cases of synthesized sketches with different numbers of strokes. (b) Evaluating the value of recognizability using proposed metrics on sketches with varying stroke counts.
  • Figure 3: (a) Example of essential regions: We argue that lines near key points influence the expression of structure, and their erasure can impair the recognition of limb positions. (b) In the sketches synthesized by Clipasso, we erase a certain number of essential regions and calculate the scores of various similarity metrics on the erased sketches. These scores are normalized by subtracting the scores of the sketches without erasure.