Table of Contents
Fetching ...

Learning to Incorporate Texture Saliency Adaptive Attention to Image Cartoonization

Xiang Gao, Yuqi Zhang, Yingjie Tian

TL;DR

This work tackles image cartoonization by arguing that cartoon textures are concentrated in edge-distinct local regions and that standard adversarial training on full images insufficiently transfers these features. It introduces a dual-adversarial framework with an image-level discriminator and a patch-level discriminator guided by a cartoon-texture-saliency-sampler (CTSS) that adaptively selects top-$K$ texture-salient patches per mini-batch, leveraging guided-filter edge maps and a Sobel-like kernel to focus learning on texture-rich regions. The approach yields a compact, end-to-end model that omits edge-smoothing data preparation and additional style losses, yet achieves more abstract and vivid cartoonization, particularly for high-resolution inputs, as evidenced by lower FID scores and robust qualitative results. Overall, the CTSS-driven texture-focused attention significantly improves cartoon texture transfer and visual appeal, with practical impact for automatic cartoonization in real-world image processing pipelines.

Abstract

Image cartoonization is recently dominated by generative adversarial networks (GANs) from the perspective of unsupervised image-to-image translation, in which an inherent challenge is to precisely capture and sufficiently transfer characteristic cartoon styles (e.g., clear edges, smooth color shading, abstract fine structures, etc.). Existing advanced models try to enhance cartoonization effect by learning to promote edges adversarially, introducing style transfer loss, or learning to align style from multiple representation space. This paper demonstrates that more distinct and vivid cartoonization effect could be easily achieved with only basic adversarial loss. Observing that cartoon style is more evident in cartoon-texture-salient local image regions, we build a region-level adversarial learning branch in parallel with the normal image-level one, which constrains adversarial learning on cartoon-texture-salient local patches for better perceiving and transferring cartoon texture features. To this end, a novel cartoon-texture-saliency-sampler (CTSS) module is proposed to dynamically sample cartoon-texture-salient patches from training data. With extensive experiments, we demonstrate that texture saliency adaptive attention in adversarial learning, as a missing ingredient of related methods in image cartoonization, is of significant importance in facilitating and enhancing image cartoon stylization, especially for high-resolution input pictures.

Learning to Incorporate Texture Saliency Adaptive Attention to Image Cartoonization

TL;DR

This work tackles image cartoonization by arguing that cartoon textures are concentrated in edge-distinct local regions and that standard adversarial training on full images insufficiently transfers these features. It introduces a dual-adversarial framework with an image-level discriminator and a patch-level discriminator guided by a cartoon-texture-saliency-sampler (CTSS) that adaptively selects top- texture-salient patches per mini-batch, leveraging guided-filter edge maps and a Sobel-like kernel to focus learning on texture-rich regions. The approach yields a compact, end-to-end model that omits edge-smoothing data preparation and additional style losses, yet achieves more abstract and vivid cartoonization, particularly for high-resolution inputs, as evidenced by lower FID scores and robust qualitative results. Overall, the CTSS-driven texture-focused attention significantly improves cartoon texture transfer and visual appeal, with practical impact for automatic cartoonization in real-world image processing pipelines.

Abstract

Image cartoonization is recently dominated by generative adversarial networks (GANs) from the perspective of unsupervised image-to-image translation, in which an inherent challenge is to precisely capture and sufficiently transfer characteristic cartoon styles (e.g., clear edges, smooth color shading, abstract fine structures, etc.). Existing advanced models try to enhance cartoonization effect by learning to promote edges adversarially, introducing style transfer loss, or learning to align style from multiple representation space. This paper demonstrates that more distinct and vivid cartoonization effect could be easily achieved with only basic adversarial loss. Observing that cartoon style is more evident in cartoon-texture-salient local image regions, we build a region-level adversarial learning branch in parallel with the normal image-level one, which constrains adversarial learning on cartoon-texture-salient local patches for better perceiving and transferring cartoon texture features. To this end, a novel cartoon-texture-saliency-sampler (CTSS) module is proposed to dynamically sample cartoon-texture-salient patches from training data. With extensive experiments, we demonstrate that texture saliency adaptive attention in adversarial learning, as a missing ingredient of related methods in image cartoonization, is of significant importance in facilitating and enhancing image cartoon stylization, especially for high-resolution input pictures.
Paper Structure (16 sections, 17 equations, 22 figures, 4 tables)

This paper contains 16 sections, 17 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Example evaluation results of our method in transforming real-world scenes into cartoon styles (better to zoom in).
  • Figure 2: The overall architecture of our model, as well as details of our proposed cartoon-texture-saliency-sampler (CTSS) module which adaptively extracts local image patches with most salient cartoon texture pattern from each mini-batch of input images.
  • Figure 3: The typical cartoon texture pattern manifests clearly only in partial image regions with distinct edges.
  • Figure 4: Visualization of the refined edge maps $\tilde{E}$ produced during the forward pass of our CTSS module.
  • Figure 5: Example image cartoonization results tested over high-resolution real-world-scene input images. Results are evaluated on our model trained over different cartoon datasets, including "The Wind Rises" (the second row), "Dragon Ball" (the third row), and "Crayon Shin-chan" (the bottom row). Please zoom in for better resolution.
  • ...and 17 more figures