ObjMST: An Object-Focused Multimodal Style Transfer Framework

Chanda Grover Kamra; Indra Deep Mastan; Debayan Gupta

ObjMST: An Object-Focused Multimodal Style Transfer Framework

Chanda Grover Kamra, Indra Deep Mastan, Debayan Gupta

TL;DR

ObjMST tackles multimodal image style transfer by decoupling style supervision for salient objects and their surroundings. It introduces a masked directional CLIP loss $L_{fg}$/$L_{bg}$ to generate aligned style representations via cross-modal GAN inversion, and a Salient-To-Key (S2K) attention mechanism to map content features to stable style keys for foreground stylization, followed by image harmonization to blend with the background. The approach yields distinct foreground/background style representations ($S_{fg}$, $S_{bg}$) and improves semantic preservation, alignment, and perceptual quality, outperforming state-of-the-art baselines on text-based and multimodal IST tasks; results are supported by Clipscore, LPIPS, Contrique, NIMA metrics, and user studies. The work provides practical, controllable stylization with robust alignment between textual and visual cues, and suggests future expansion to multi-object scenes with diverse, separate style representations.

Abstract

We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: https://github.com/chandagrover/ObjMST.

ObjMST: An Object-Focused Multimodal Style Transfer Framework

TL;DR

ObjMST tackles multimodal image style transfer by decoupling style supervision for salient objects and their surroundings. It introduces a masked directional CLIP loss

to generate aligned style representations via cross-modal GAN inversion, and a Salient-To-Key (S2K) attention mechanism to map content features to stable style keys for foreground stylization, followed by image harmonization to blend with the background. The approach yields distinct foreground/background style representations (

) and improves semantic preservation, alignment, and perceptual quality, outperforming state-of-the-art baselines on text-based and multimodal IST tasks; results are supported by Clipscore, LPIPS, Contrique, NIMA metrics, and user studies. The work provides practical, controllable stylization with robust alignment between textual and visual cues, and suggests future expansion to multi-object scenes with diverse, separate style representations.

ObjMST: An Object-Focused Multimodal Style Transfer Framework

TL;DR

Abstract

ObjMST: An Object-Focused Multimodal Style Transfer Framework

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)