Table of Contents
Fetching ...

HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models

Hanzhang Wang, Haoran Wang, Jinze Yang, Zhongrui Yu, Zeke Xie, Lei Tian, Xinyan Xiao, Junjun Jiang, Xianming Liu, Mingming Sun

TL;DR

HiCAST tackles Arbitrary Style Transfer with a demand for explicit, multi-level customization rather than a single strength control. By building on Latent Diffusion Models and introducing a Style Adapter, the method injects content and style cues from multiple semantic levels and allows flexible user control over stylization. The approach extends to video by adding temporal layers and a Harmonious Consistency Loss to maintain cross-frame coherence without sacrificing stylization strength. Across image and video benchmarks, HiCAST demonstrates superior subjective quality and competitive objective metrics while enabling precise, signal-driven stylization control.

Abstract

The goal of Arbitrary Style Transfer (AST) is injecting the artistic features of a style reference into a given image/video. Existing methods usually focus on pursuing the balance between style and content, whereas ignoring the significant demand for flexible and customized stylization results and thereby limiting their practical application. To address this critical issue, a novel AST approach namely HiCAST is proposed, which is capable of explicitly customizing the stylization results according to various source of semantic clues. In the specific, our model is constructed based on Latent Diffusion Model (LDM) and elaborately designed to absorb content and style instance as conditions of LDM. It is characterized by introducing of \textit{Style Adapter}, which allows user to flexibly manipulate the output results by aligning multi-level style information and intrinsic knowledge in LDM. Lastly, we further extend our model to perform video AST. A novel learning objective is leveraged for video diffusion model training, which significantly improve cross-frame temporal consistency in the premise of maintaining stylization strength. Qualitative and quantitative comparisons as well as comprehensive user studies demonstrate that our HiCAST outperforms the existing SoTA methods in generating visually plausible stylization results.

HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models

TL;DR

HiCAST tackles Arbitrary Style Transfer with a demand for explicit, multi-level customization rather than a single strength control. By building on Latent Diffusion Models and introducing a Style Adapter, the method injects content and style cues from multiple semantic levels and allows flexible user control over stylization. The approach extends to video by adding temporal layers and a Harmonious Consistency Loss to maintain cross-frame coherence without sacrificing stylization strength. Across image and video benchmarks, HiCAST demonstrates superior subjective quality and competitive objective metrics while enabling precise, signal-driven stylization control.

Abstract

The goal of Arbitrary Style Transfer (AST) is injecting the artistic features of a style reference into a given image/video. Existing methods usually focus on pursuing the balance between style and content, whereas ignoring the significant demand for flexible and customized stylization results and thereby limiting their practical application. To address this critical issue, a novel AST approach namely HiCAST is proposed, which is capable of explicitly customizing the stylization results according to various source of semantic clues. In the specific, our model is constructed based on Latent Diffusion Model (LDM) and elaborately designed to absorb content and style instance as conditions of LDM. It is characterized by introducing of \textit{Style Adapter}, which allows user to flexibly manipulate the output results by aligning multi-level style information and intrinsic knowledge in LDM. Lastly, we further extend our model to perform video AST. A novel learning objective is leveraged for video diffusion model training, which significantly improve cross-frame temporal consistency in the premise of maintaining stylization strength. Qualitative and quantitative comparisons as well as comprehensive user studies demonstrate that our HiCAST outperforms the existing SoTA methods in generating visually plausible stylization results.
Paper Structure (18 sections, 2 equations, 12 figures, 2 tables)

This paper contains 18 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: In contrast to existing AST methods (a-d), our proposed HiCAST model can customize the stylization results according to different control signals (e.g. edge (f), depth (g), semantic segmentation (h)). [Best viewed with zooming-in]
  • Figure 2: Results of image AST methods.
  • Figure 3: Comparisons of short-term temporal consistency on video AST methods. The odd rows show the previous frame. The even rows show the temporal error heatmap.
  • Figure 4: Image AST results with different control maps.
  • Figure 5: Image AST results using HED controllable maps with different weights.
  • ...and 7 more figures