Table of Contents
Fetching ...

High-Resolution Daytime Translation Without Domain Labels

Ivan Anokhin, Pavel Solovev, Denis Korzhenkov, Alexey Kharlamov, Taras Khakhulin, Alexey Silvestrov, Sergey Nikolenko, Victor Lempitsky, Gleb Sterkin

TL;DR

HiDT tackles the problem of daytime translation for high-resolution landscape images without relying on domain labels. It introduces a content/style disentangled architecture with AdaIN-based generation and augmented skip connections, complemented by a postprocessing enhancement pipeline to produce high-resolution outputs. The method is trained on unaligned images with weak segmentation supervision and employs a suite of losses, including a CORAL-inspired style distribution loss, to learn robust style transfer. Experiments show HiDT is competitive with label-dependent baselines and generalizes to other domains, with practical applications such as timelapse generation from single images.

Abstract

Modeling daytime changes in high resolution photographs, e.g., re-rendering the same scene under different illuminations typical for day, night, or dawn, is a challenging image manipulation task. We present the high-resolution daytime translation (HiDT) model for this task. HiDT combines a generative image-to-image model and a new upsampling scheme that allows to apply image translation at high resolution. The model demonstrates competitive results in terms of both commonly used GAN metrics and human evaluation. Importantly, this good performance comes as a result of training on a dataset of still landscape images with no daytime labels available. Our results are available at https://saic-mdal.github.io/HiDT/.

High-Resolution Daytime Translation Without Domain Labels

TL;DR

HiDT tackles the problem of daytime translation for high-resolution landscape images without relying on domain labels. It introduces a content/style disentangled architecture with AdaIN-based generation and augmented skip connections, complemented by a postprocessing enhancement pipeline to produce high-resolution outputs. The method is trained on unaligned images with weak segmentation supervision and employs a suite of losses, including a CORAL-inspired style distribution loss, to learn robust style transfer. Experiments show HiDT is competitive with label-dependent baselines and generalizes to other domains, with practical applications such as timelapse generation from single images.

Abstract

Modeling daytime changes in high resolution photographs, e.g., re-rendering the same scene under different illuminations typical for day, night, or dawn, is a challenging image manipulation task. We present the high-resolution daytime translation (HiDT) model for this task. HiDT combines a generative image-to-image model and a new upsampling scheme that allows to apply image translation at high resolution. The model demonstrates competitive results in terms of both commonly used GAN metrics and human evaluation. Importantly, this good performance comes as a result of training on a dataset of still landscape images with no daytime labels available. Our results are available at https://saic-mdal.github.io/HiDT/.

Paper Structure

This paper contains 10 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Daytime translation results. Left -- original images, right -- translated and enhanced images (one style per column).
  • Figure 2: Diagram of the Adaptive U-Net architecture: an encoder-decoder network with dense skip-connections and content-style decomposition $(\mathbf{c}, \mathbf{s})$.
  • Figure 3: HiDT learning data flow. We show half of the (symmetric) architecture; $\mathbf{s}'=E_s(\mathbf{x}')$ is the style extracted from the other image $\mathbf{x}'$, and ${\hat{\mathbf{s}}}'$ is obtained similarly to ${\hat{\mathbf{s}}}$ with $\mathbf{x}$ and $\mathbf{x}'$ swapped. Light blue nodes denote data elements; light green, loss functions; others, functions (subnetworks). Functions with identical labels have shared weights. Adversarial losses are omitted for clarity.
  • Figure 4: Enhancement scheme: the input is split into subimages (color-coded) that are translated individually by HiDT at medium resolution. The outputs are then merged using the merging network $G_{\mathrm{enh}}$. For illustration purposes, we show upsampling by a factor of two, but in the experiments we use a factor of four. We also apply bilinear downsampling (with shifts -- see text for detail) rather than strided subsampling when decomposing the input into medium resolution images.
  • Figure 5: Training without segmentation losses is prone to failures of semantic consistency. Left: original images. Right: transferred images. (a) Our ablated model, trained without auxiliary segmentation task, turns grass into water; (b) FUNIT hallucinates grass on the building.
  • ...and 5 more figures