LEAST: "Local" text-conditioned image style transfer
Silky Singh, Surgan Jandial, Simra Shahid, Abhinav Java
TL;DR
This paper addresses the lack of region-aware text-conditioned style transfer by introducing an end-to-end local stylization pipeline. It grounds the target region and style description with LLaVA and SAM and then performs region-constrained stylization through masked CLIP losses during inference, without additional training. The approach supports multi-region transfers and demonstrates competitive CLIP alignment while delivering superior localization and content preservation relative to baselines, as shown by human preference results. A 25-image dataset with 10 region-style prompts per image is used to illustrate practical applicability and robustness of the method.
Abstract
Text-conditioned style transfer enables users to communicate their desired artistic styles through text descriptions, offering a new and expressive means of achieving stylization. In this work, we evaluate the text-conditioned image editing and style transfer techniques on their fine-grained understanding of user prompts for precise "local" style transfer. We find that current methods fail to accomplish localized style transfers effectively, either failing to localize style transfer to certain regions in the image, or distorting the content and structure of the input image. To this end, we develop an end-to-end pipeline for "local" style transfer tailored to align with users' intent. Further, we substantiate the effectiveness of our approach through quantitative and qualitative analysis. The project code is available at: https://github.com/silky1708/local-style-transfer.
