Table of Contents
Fetching ...

ChildlikeSHAPES: Semantic Hierarchical Region Parsing for Animating Figure Drawings

Astitva Srivastava, Harrison Jesse Smith, Thu Nguyen-Phuoc, Yuting Ye

TL;DR

Childlike drawings present a semantic segmentation challenge due to their abstract, region-based representations. We propose CharSegNet, a hierarchical segmentation model built on a fine-tuned Segment Anything Model, and introduce the ChildlikeSHAPES dataset with 25 semantic parts across over 16k drawings. The work enables downstream animation tasks including facial expression generation, audio-driven lip-sync, figure shading, and improved body animation, and demonstrates strong cross-domain generalization to out-of-domain drawings. Together, CharSegNet and ChildlikeSHAPES offer a practical, style-preserving foundation for accessible hand-drawn character animation and advance understanding of semantic representations in abstract art.

Abstract

Childlike human figure drawings represent one of humanity's most accessible forms of character expression, yet automatically analyzing their contents remains a significant challenge. While semantic segmentation of realistic humans has recently advanced considerably, existing models often fail when confronted with the abstract, representational nature of childlike drawings. This semantic understanding is a crucial prerequisite for animation tools that seek to modify figures while preserving their unique style. To help achieve this, we propose a novel hierarchical segmentation model, built upon the architecture and pre-trained SAM, to quickly and accurately obtain these semantic labels. Our model achieves higher accuracy than state-of-the-art segmentation models focused on realistic humans and cartoon figures, even after fine-tuning. We demonstrate the value of our model for semantic segmentation through multiple applications: a fully automatic facial animation pipeline, a figure relighting pipeline, improvements to an existing childlike human figure drawing animation method, and generalization to out-of-domain figures. Finally, to support future work in this area, we introduce a dataset of 16,000 childlike drawings with pixel-level annotations across 25 semantic categories. Our work can enable entirely new, easily accessible tools for hand-drawn character animation, and our dataset can enable new lines of inquiry in a variety of graphics and human-centric research fields.

ChildlikeSHAPES: Semantic Hierarchical Region Parsing for Animating Figure Drawings

TL;DR

Childlike drawings present a semantic segmentation challenge due to their abstract, region-based representations. We propose CharSegNet, a hierarchical segmentation model built on a fine-tuned Segment Anything Model, and introduce the ChildlikeSHAPES dataset with 25 semantic parts across over 16k drawings. The work enables downstream animation tasks including facial expression generation, audio-driven lip-sync, figure shading, and improved body animation, and demonstrates strong cross-domain generalization to out-of-domain drawings. Together, CharSegNet and ChildlikeSHAPES offer a practical, style-preserving foundation for accessible hand-drawn character animation and advance understanding of semantic representations in abstract art.

Abstract

Childlike human figure drawings represent one of humanity's most accessible forms of character expression, yet automatically analyzing their contents remains a significant challenge. While semantic segmentation of realistic humans has recently advanced considerably, existing models often fail when confronted with the abstract, representational nature of childlike drawings. This semantic understanding is a crucial prerequisite for animation tools that seek to modify figures while preserving their unique style. To help achieve this, we propose a novel hierarchical segmentation model, built upon the architecture and pre-trained SAM, to quickly and accurately obtain these semantic labels. Our model achieves higher accuracy than state-of-the-art segmentation models focused on realistic humans and cartoon figures, even after fine-tuning. We demonstrate the value of our model for semantic segmentation through multiple applications: a fully automatic facial animation pipeline, a figure relighting pipeline, improvements to an existing childlike human figure drawing animation method, and generalization to out-of-domain figures. Finally, to support future work in this area, we introduce a dataset of 16,000 childlike drawings with pixel-level annotations across 25 semantic categories. Our work can enable entirely new, easily accessible tools for hand-drawn character animation, and our dataset can enable new lines of inquiry in a variety of graphics and human-centric research fields.

Paper Structure

This paper contains 27 sections, 2 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Representative images and manually created annotations for the 25 classes in our ChildlikeSHAPES dataset.
  • Figure 2: Proposed pipeline for Hierarchical Semantic Segmentation. The input image is passed to the coarse segmentation network, which predicts a coarse semantic mask. The coarse mask is then fed to the fine network via the prompt encoder to predict part-level semantic segmentation. Finally, the face region from the fine mask is cropped and passed to the face-segmentation network to predict detailed semantics in the face region.
  • Figure 3: Generating Novel Facial Expressions: Proposed pipeline for preset guided novel eye & mouth shape generation, which involves deforming the predefined semantically labeled presets via salient points, followed by a semantics-guided image synthesis to produce new details in the region of deformed presets. The newly generated details are then masked out and composited over an inpainted image (with original eyes and mouth removed) to produce new facial expressions.
  • Figure 4: Proposed CharShadeNet architecture, which finetunes pretrained CharSegNet's image encoder to predict shading maps, conditioned on a 3D light position. The predicted shading are then blended with input image to enable shading.
  • Figure 5: Relationship between training dataset size and performance metrics: pixel accuracy, mean accuracy, and mean intersection-over-union (IoU).
  • ...and 11 more figures