Table of Contents
Fetching ...

Style-Editor: Text-driven object-centric style editing

Jihun Park, Jongmin Gim, Kyoungmin Lee, Seunghun Lee, Sunghoon Im

TL;DR

This work tackles text-guided, object-centric image editing by editing only the targeted object's appearance while preserving the background. It introduces Style-Editor, a CLIP-guided pipeline that localizes object regions via Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) and applies a Patch-wise Co-Directional (PCD) loss to align style changes with textual input, complemented by an Adaptive Background Preservation (ABP) loss. The method yields robust foreground fidelity and natural background preservation, without relying on segmentation masks, through an effective combination of patch-level terminology and distribution alignment (via $L_{pcd}$ and $L_{con}$) and adaptive masking. Experimental results on MSCOCO show state-of-the-art object-centric stylization with coherent CLIP alignment and competitive efficiency, highlighting the approach’s practicality for editorial workflows and creative exploration.

Abstract

We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs. The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.

Style-Editor: Text-driven object-centric style editing

TL;DR

This work tackles text-guided, object-centric image editing by editing only the targeted object's appearance while preserving the background. It introduces Style-Editor, a CLIP-guided pipeline that localizes object regions via Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) and applies a Patch-wise Co-Directional (PCD) loss to align style changes with textual input, complemented by an Adaptive Background Preservation (ABP) loss. The method yields robust foreground fidelity and natural background preservation, without relying on segmentation masks, through an effective combination of patch-level terminology and distribution alignment (via and ) and adaptive masking. Experimental results on MSCOCO show state-of-the-art object-centric stylization with coherent CLIP alignment and competitive efficiency, highlighting the approach’s practicality for editorial workflows and creative exploration.

Abstract

We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs. The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.
Paper Structure (31 sections, 7 equations, 19 figures, 7 tables, 2 algorithms)

This paper contains 31 sections, 7 equations, 19 figures, 7 tables, 2 algorithms.

Figures (19)

  • Figure 1: Results of our Style-Editor under diverse textual conditions. To the left and right of the arrow($\rightarrow$) indicate the source text $T^{\text{src}}$ and style text $T^{\text{sty}}$, respectively.
  • Figure 2: Our editing results in industrial applications.
  • Figure 3: (Left) Overall pipeline of our Style-Editor consisting of a style editing network (StyleNet), Pre-fixed Region Selection (PRS), Text-Matched Patch Selection (TMPS) module and pretrained CLIP encoders. The StyleNet takes a source image $I^{\text{src}}$ and generates an object-wise stylized image $I^{\text{out}}$. The TMPS module is responsible for pinpointing patches that most closely correspond to $T^{\text{src}}$ from the foreground regions identified by the PRS. The selected and augmented patches, $\mathbf{P}^{\text{src}}, \mathbf{P}^{\text{out}}$, are then aligned with $T^{\text{src}}, T^{\text{tgt}}$ in the CLIP embedding space using Patch-wise Co-Directional (PCD) loss $\mathcal{L}_{\text{pcd}}$. The target text $T^{\text{tgt}}$ is derived by central word selection. Additionally, we apply a content loss $\mathcal{L}_{\text{c}}$, an Adaptive Background Preservation (ABP) loss $\mathcal{L}_{\text{abp}}$ to enhance object-centric style editing, along with a total variance loss $\mathcal{L}_{\text{tv}}$ for regularization. (Right) Illustration of the functionality of the PCD loss $\mathcal{L}_{\text{pcd}}$ in feature space. It is composed of a patch-wise directional loss $\mathcal{L}_{\text{dir}}$ and a patch distribution consistency loss $\mathcal{L}_{\text{con}}$.
  • Figure 4: Comparison of our method with various text-guided style editing models. To the left of the solid line are the qualitative results of our model and non-diffusion based models, to the right are the results from diffusion-based methods.
  • Figure 5: Qualitative results showcasing the impact of applying/omitting the proposed losses. Configurations (a)-(e) correspond to the settings detailed in Tab. \ref{['tab:ablation']}.
  • ...and 14 more figures