Table of Contents
Fetching ...

Learning to Manipulate Artistic Images

Wei Guo, Yuqi Zhang, De Ma, Qian Zheng

TL;DR

This work tackles zero-shot manipulation of artistic images without relying on semantic inputs, addressing cross-domain artifacts and imprecise local details in prior exemplar-based methods. It introduces SIM-Net, a dual-branch framework with a Mask-Based Correspondence Network that operates on semantic-free masks $y_A$ and exemplar guidance $y_B$, producing full-resolution warp fields $\omega^k$ via a Dilating Module to guide region-wise transformation. A Translation Network then merges warped regions through a region transportation strategy and refines results using a Texture-Guidance Module that forms a pseudo ground truth with a single warp $\omega_{\mathcal{S}}$, supervised by self-supervised losses including $\mathcal{L}_{bound}$, $\mathcal{L}_{context}$, and $\mathcal{L}_{cyc}$ within the total loss $\mathcal{L}_{total}$. Experiments across 237 artistic images and 10 styles show that SIM-Net achieves competitive style fidelity and high-quality, artifact-free manipulations with efficient computation, highlighting its practical impact for flexible, high-resolution artistic image editing without extensive style-specific training.

Abstract

Recent advancement in computer vision has significantly lowered the barriers to artistic creation. Exemplar-based image translation methods have attracted much attention due to flexibility and controllability. However, these methods hold assumptions regarding semantics or require semantic information as the input, while accurate semantics is not easy to obtain in artistic images. Besides, these methods suffer from cross-domain artifacts due to training data prior and generate imprecise structure due to feature compression in the spatial domain. In this paper, we propose an arbitrary Style Image Manipulation Network (SIM-Net), which leverages semantic-free information as guidance and a region transportation strategy in a self-supervised manner for image generation. Our method balances computational efficiency and high resolution to a certain extent. Moreover, our method facilitates zero-shot style image manipulation. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art methods.Code is available at https://github.com/SnailForce/SIM-Net.

Learning to Manipulate Artistic Images

TL;DR

This work tackles zero-shot manipulation of artistic images without relying on semantic inputs, addressing cross-domain artifacts and imprecise local details in prior exemplar-based methods. It introduces SIM-Net, a dual-branch framework with a Mask-Based Correspondence Network that operates on semantic-free masks and exemplar guidance , producing full-resolution warp fields via a Dilating Module to guide region-wise transformation. A Translation Network then merges warped regions through a region transportation strategy and refines results using a Texture-Guidance Module that forms a pseudo ground truth with a single warp , supervised by self-supervised losses including , , and within the total loss . Experiments across 237 artistic images and 10 styles show that SIM-Net achieves competitive style fidelity and high-quality, artifact-free manipulations with efficient computation, highlighting its practical impact for flexible, high-resolution artistic image editing without extensive style-specific training.

Abstract

Recent advancement in computer vision has significantly lowered the barriers to artistic creation. Exemplar-based image translation methods have attracted much attention due to flexibility and controllability. However, these methods hold assumptions regarding semantics or require semantic information as the input, while accurate semantics is not easy to obtain in artistic images. Besides, these methods suffer from cross-domain artifacts due to training data prior and generate imprecise structure due to feature compression in the spatial domain. In this paper, we propose an arbitrary Style Image Manipulation Network (SIM-Net), which leverages semantic-free information as guidance and a region transportation strategy in a self-supervised manner for image generation. Our method balances computational efficiency and high resolution to a certain extent. Moreover, our method facilitates zero-shot style image manipulation. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art methods.Code is available at https://github.com/SnailForce/SIM-Net.
Paper Structure (15 sections, 8 equations, 7 figures, 3 tables)

This paper contains 15 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) The framework of state-of-the-art exemplar-based image translation methods, such as CoCosNet v2 zhou2021cocosnet, MCL-Net zhan2022marginal, and MATEBIT jiang2023masked. (b) These methods require accurate semantic conditional input, while accurate semantic information of artistic images is difficult to extract. (c) The spatial compression in the cross-domain alignment phase leads to imprecise local details. (d) The conditional generation phase might introduce cross-domain artifacts.
  • Figure 2: The overall architecture of SIM-Net. The SAM module is used to extract the semantic-free mask of exemplar, denoted as $y_A$, which is then edited by users to obtain the conditional mask, denoted as $x_A$. First, the Local Region Alignment Module is used to generate a few number of keypoints that adaptively govern modified regions. Subsequently, the Dilating Module is employed to establish multiple full-resolution corresponding warp fields corresponding to keypoints for global control. Notably, these wrap fields exhibit better control over the region near their corresponding keypoints. Finally, to utilize the characteristics of warp fields, we propose the region transportation strategy implemented by the Image Transport Module, utilizing multiple warp fields to construct the generated image, denoted as $\hat{x}_B$. However, $\hat{x}_B$ exhibits splicing artifacts marked by spatial inconsistency. We further design the Texture-Guidance Module to construct the pseudo ground truth, denoted as $x_B$, serving as a self-supervised signal to eliminate splicing artifacts to ensure spatial consistency.
  • Figure 3: The intermediate results of a training sample in the early epoch. It is evident that $\hat{x}_B^{\mathcal{P}}$ exhibits better geometric consistency and spatial consistency in the early epoch, thanks to the geometric consensus achieved through the warp fields. However, the layout of $\hat{x}_B^{\mathcal{P}}$ is less consistent with $x_A$. Additionally, it can be observed that $\hat{x}_B$ demonstrates improved semantic consistency as a result of the fusion of several candidates. However, this process can introduce fusion and splicing artifacts that may disrupt the geometric consistency and spatial consistency. The results provide visual evidence of the trade-off between geometric consistency, spatial consistency, and semantic consistency in the intermediate results during the early training epochs.
  • Figure 4: Qualitative comparison of our method and state-of-the-art methods in terms of high-resolution.
  • Figure 5: Visual qualitative comparison with state-of-the-art methods. It can be seen that our method has no cross-domain artifacts and fine details without blurring.
  • ...and 2 more figures