Table of Contents
Fetching ...

LEAST: "Local" text-conditioned image style transfer

Silky Singh, Surgan Jandial, Simra Shahid, Abhinav Java

TL;DR

This paper addresses the lack of region-aware text-conditioned style transfer by introducing an end-to-end local stylization pipeline. It grounds the target region and style description with LLaVA and SAM and then performs region-constrained stylization through masked CLIP losses during inference, without additional training. The approach supports multi-region transfers and demonstrates competitive CLIP alignment while delivering superior localization and content preservation relative to baselines, as shown by human preference results. A 25-image dataset with 10 region-style prompts per image is used to illustrate practical applicability and robustness of the method.

Abstract

Text-conditioned style transfer enables users to communicate their desired artistic styles through text descriptions, offering a new and expressive means of achieving stylization. In this work, we evaluate the text-conditioned image editing and style transfer techniques on their fine-grained understanding of user prompts for precise "local" style transfer. We find that current methods fail to accomplish localized style transfers effectively, either failing to localize style transfer to certain regions in the image, or distorting the content and structure of the input image. To this end, we develop an end-to-end pipeline for "local" style transfer tailored to align with users' intent. Further, we substantiate the effectiveness of our approach through quantitative and qualitative analysis. The project code is available at: https://github.com/silky1708/local-style-transfer.

LEAST: "Local" text-conditioned image style transfer

TL;DR

This paper addresses the lack of region-aware text-conditioned style transfer by introducing an end-to-end local stylization pipeline. It grounds the target region and style description with LLaVA and SAM and then performs region-constrained stylization through masked CLIP losses during inference, without additional training. The approach supports multi-region transfers and demonstrates competitive CLIP alignment while delivering superior localization and content preservation relative to baselines, as shown by human preference results. A 25-image dataset with 10 region-style prompts per image is used to illustrate practical applicability and robustness of the method.

Abstract

Text-conditioned style transfer enables users to communicate their desired artistic styles through text descriptions, offering a new and expressive means of achieving stylization. In this work, we evaluate the text-conditioned image editing and style transfer techniques on their fine-grained understanding of user prompts for precise "local" style transfer. We find that current methods fail to accomplish localized style transfers effectively, either failing to localize style transfer to certain regions in the image, or distorting the content and structure of the input image. To this end, we develop an end-to-end pipeline for "local" style transfer tailored to align with users' intent. Further, we substantiate the effectiveness of our approach through quantitative and qualitative analysis. The project code is available at: https://github.com/silky1708/local-style-transfer.
Paper Structure (7 sections, 3 equations, 7 figures, 2 tables)

This paper contains 7 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of our approach. a) Text-grounding in the content image: We first use LLaVA and SAM to obtain a precise segmentation mask of specified region in the image, and the corresponding desired style. b) Local style transfer: We use the region-style correspondences to constrain style transfer to the specified local region. c) This process can be repeated several times to achieve multi-region style transfer.
  • Figure 2: Qualitative comparison of our proposed method with text-conditioned image editing and style transfer approaches. All the baselines fail to localize the desired style transfers, sometimes also failing to preserve the content of the input image (see Instruct-pix2pix $5^{th}$ row). Best viewed in zoom and color. More results in Fig. \ref{['fig:A1_qual1']} and \ref{['fig:A1_qual2']}
  • Figure 3: Snapshot of our dataset. We collect a set of 25 natural images to evaluate the efficacy of various approaches on the task of local style transfer. The copyrights exist with respective owners of these images.
  • Figure 4: Qualitative comparison of our proposed method with text-conditioned image editing and style transfer approaches.
  • Figure 5: Qualitative comparison of our proposed method with text-conditioned image editing and style transfer approaches.
  • ...and 2 more figures