Table of Contents
Fetching ...

ObjMST: An Object-Focused Multimodal Style Transfer Framework

Chanda Grover Kamra, Indra Deep Mastan, Debayan Gupta

TL;DR

ObjMST tackles multimodal image style transfer by decoupling style supervision for salient objects and their surroundings. It introduces a masked directional CLIP loss $L_{fg}$/$L_{bg}$ to generate aligned style representations via cross-modal GAN inversion, and a Salient-To-Key (S2K) attention mechanism to map content features to stable style keys for foreground stylization, followed by image harmonization to blend with the background. The approach yields distinct foreground/background style representations ($S_{fg}$, $S_{bg}$) and improves semantic preservation, alignment, and perceptual quality, outperforming state-of-the-art baselines on text-based and multimodal IST tasks; results are supported by Clipscore, LPIPS, Contrique, NIMA metrics, and user studies. The work provides practical, controllable stylization with robust alignment between textual and visual cues, and suggests future expansion to multi-object scenes with diverse, separate style representations.

Abstract

We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.

ObjMST: An Object-Focused Multimodal Style Transfer Framework

TL;DR

ObjMST tackles multimodal image style transfer by decoupling style supervision for salient objects and their surroundings. It introduces a masked directional CLIP loss / to generate aligned style representations via cross-modal GAN inversion, and a Salient-To-Key (S2K) attention mechanism to map content features to stable style keys for foreground stylization, followed by image harmonization to blend with the background. The approach yields distinct foreground/background style representations (, ) and improves semantic preservation, alignment, and perceptual quality, outperforming state-of-the-art baselines on text-based and multimodal IST tasks; results are supported by Clipscore, LPIPS, Contrique, NIMA metrics, and user studies. The work provides practical, controllable stylization with robust alignment between textual and visual cues, and suggests future expansion to multi-object scenes with diverse, separate style representations.

Abstract

We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.

Paper Structure

This paper contains 13 sections, 9 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: The comparative stylized outputs are presented as follows: (i) Top Row, Left Side: MMIST (Single); (ii) Top Row, Right Side: TIST (Double); and (iii) Bottom Row: TIST (Single). Columns (a, e, j) represent the content images, while columns (b, f, i) show the multimodal style inputs. In MMIST Wang2024WACV (column c), misalignment is evident as the texture and color of the copper plate features are inconsistent compared to Ours-ObjMST (column d). In TIST (Double), SemCS kamra2023sem (column g) introduces undesired distortions, whereas ObjMST (column h) correctly applies the "Starry Night" style features to the background (sky) and ice features to the foreground. In TIST (Single), ObjMST (column o) effectively preserves the content features while accurately applying the desired style.
  • Figure 2: The figure illustrates the proposed ObjMST framework. Given the segmentation mask ($M_{\mathcal{C}}$) of the content image ($I_{\mathcal{C}}$), we compute the masked content image ($I_M$). In Step 1, we compute the optimal foreground and background latent vector $w_{fg}^*$ and $w_{bg}^*$ to obtain the salient and surrounding style representations $S_{fg}$ and $S_{bg}$. This is achieved by passing multimodal input of foreground-input style text-image pair ($T_{S_{fg}} \text{,} I_{S_{fg}}$) and background-input style text-image pair ($T_{S_{bg}} \text{,} I_{S_{bg}}$) to cross-modal GAN Inversion, which is trained using the proposed masked directional Style CLIP Loss ($L_{fg}$). In Step 2, the foreground stylized output ($I_{CS}^{'}$) is generated by mapping salient content features ($F_C^l$) to stable style key features ($F_{S_{fg}}^l$) through ($M_{S2K}$) mapper. Finally, surrounding-style ($S_{bg}$) representation is applied to the background through image harmonization to generate stylized output $I_{CS}$.
  • Figure 3: Text-based IST (Single). ObjMST (g) better preserves the facial structure and harmoniously integrates text and visual style cues, particularly in terms of texture and color consistency, compared to the baseline methods (c-f).
  • Figure 4: Content Mismatch. This figure illustrates content mismatch issues in style transfer methods, such as fire appearing on the car (first row) and graffiti style features on the building (second row). It can be observed that ObjMST (Ours) effectively minimizes these mismatches, producing more coherent stylized outputs.
  • Figure 4: NIMA Score
  • ...and 12 more figures