Table of Contents
Fetching ...

DSI2I: Dense Style for Unpaired Image-to-Image Translation

Baran Ozaydin, Tong Zhang, Sabine Süsstrunk, Mathieu Salzmann

TL;DR

This work introduces DSI2I, a framework for exemplar-based unpaired image-to-image translation that represents style as a dense, spatial feature map rather than a single global vector. By disentangling dense style from content with adversarial and perceptual losses and warping the exemplar style via cross-domain semantic correspondences (CLIP-based) and optimal transport, the method achieves finer-grained, region-aware style transfer without semantic labels. It also integrates a Dense Normalization pipeline to inject dense style and proposes a classwise stylistic distance metric to quantify alignment with exemplars. Results on synthetic-to-real and real-to-real translations show improved stylistic accuracy, content preservation, and domain fidelity compared to prior methods, with ablations validating the contribution of dense styling and OT-based correspondences. The approach broadens applicability of exemplar-guided I2I to unlabeled, multi-object scenes and offers richer control over localized stylistic changes.

Abstract

Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source image to a target image domain with the style of a target image exemplar, without ground-truth input-translation pairs. Existing UEI2I methods represent style using one vector per image or rely on semantic supervision to define one style vector per object. Here, in contrast, we propose to represent style as a dense feature map, allowing for a finer-grained transfer to the source image without requiring any external semantic information. We then rely on perceptual and adversarial losses to disentangle our dense style and content representations. To stylize the source content with the exemplar style, we extract unsupervised cross-domain semantic correspondences and warp the exemplar style to the source content. We demonstrate the effectiveness of our method on four datasets using standard metrics together with a localized style metric we propose, which measures style similarity in a class-wise manner. Our results show that the translations produced by our approach are more diverse, preserve the source content better, and are closer to the exemplars when compared to the state-of-the-art methods. Project page: https://github.com/IVRL/dsi2i

DSI2I: Dense Style for Unpaired Image-to-Image Translation

TL;DR

This work introduces DSI2I, a framework for exemplar-based unpaired image-to-image translation that represents style as a dense, spatial feature map rather than a single global vector. By disentangling dense style from content with adversarial and perceptual losses and warping the exemplar style via cross-domain semantic correspondences (CLIP-based) and optimal transport, the method achieves finer-grained, region-aware style transfer without semantic labels. It also integrates a Dense Normalization pipeline to inject dense style and proposes a classwise stylistic distance metric to quantify alignment with exemplars. Results on synthetic-to-real and real-to-real translations show improved stylistic accuracy, content preservation, and domain fidelity compared to prior methods, with ablations validating the contribution of dense styling and OT-based correspondences. The approach broadens applicability of exemplar-guided I2I to unlabeled, multi-object scenes and offers richer control over localized stylistic changes.

Abstract

Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source image to a target image domain with the style of a target image exemplar, without ground-truth input-translation pairs. Existing UEI2I methods represent style using one vector per image or rely on semantic supervision to define one style vector per object. Here, in contrast, we propose to represent style as a dense feature map, allowing for a finer-grained transfer to the source image without requiring any external semantic information. We then rely on perceptual and adversarial losses to disentangle our dense style and content representations. To stylize the source content with the exemplar style, we extract unsupervised cross-domain semantic correspondences and warp the exemplar style to the source content. We demonstrate the effectiveness of our method on four datasets using standard metrics together with a localized style metric we propose, which measures style similarity in a class-wise manner. Our results show that the translations produced by our approach are more diverse, preserve the source content better, and are closer to the exemplars when compared to the state-of-the-art methods. Project page: https://github.com/IVRL/dsi2i
Paper Structure (29 sections, 16 equations, 22 figures, 11 tables)

This paper contains 29 sections, 16 equations, 22 figures, 11 tables.

Figures (22)

  • Figure 1: Global style vs dense style representations. The baseline method (MUNIT) huang2018munit represents the exemplar style with a single feature vector per image. As such, some appearance information from the exemplar bleeds into semantically-incorrect regions, giving, for example, an unnatural bluish taint to the road and the buildings in the second row, first image. By modeling style densely, our approach better respects the semantics when applying the style from the exemplar to the source content. Our method also has finer-grained control over style. The color of the road and center line in the third row reflect the exemplar appearance more accurately.
  • Figure 2: Overview of method. We represent style as a feature map with spatial dimensions and constrain it via adversarial and perceptual losses for disentanglement. Our method does not require any labels or paired images during training. In test time, we warp the style of the exemplar for the source content using semantic correspondence. At test time, we utilize the CLIP radford2021clip vision backbone to build semantic correspondences. See Section \ref{['sec:method']} for definitions and explanations.
  • Figure 3: Style components and correspondence matrices. Example for the simulated and created correspondence matrices $\mathbf{A}_{xx}, \mathbf{A}_{xy} \in [0,1]^{3 \times 3}$. Top row includes the self-correspondence $\mathbf{A}_{xx}$ between three pixels from an image in the purple domain, whereas the bottom row displays cross-domain correspondence $\mathbf{A}_{xy}$ between an image from the purple domain and another image from the yellow domain. Using a)-e) during training enables our model to generalize to f) during test time.
  • Figure 4: Effect of the exemplar. Our method can change the appearance of each semantic region differently, yet has realistic output. The colors of the road and car in the translations match the exemplar road and car styles better than the baseline (MUNIT) huang2018munit does. Content image can be seen in Figure \ref{['fig:model']}
  • Figure 5: Qualitative comparison with other methods. CS $\rightarrow$ GTA translations. In the first column, our method disentangles the road from the sky and preserves the dark color for the road. In the second column, the appearance of the road and roadlines in our translation are closest to those in the exemplar. In the last two columns, our model preserves the semantics better, especially for tree and building classes.
  • ...and 17 more figures