Table of Contents
Fetching ...

Short-Time Fourier Transform for deblurring Variational Autoencoders

Vibhu Dalal

TL;DR

Variational Autoencoders often produce blurry samples due to their ELBO optimization. The paper proposes a reconstruction objective that uses a short-time Fourier transform (STFT) to enforce local spectral coherence, emphasizing high-frequency details through a phase-weighted loss. The local frequency loss $L_{freq}$ is computed from STFT-derived amplitudes and phases with a Hann window, combined with a SSIM pixel-term and the KL regularizer within the overall objective $L = \beta D_{KL}[q_\phi(z|x) || p(z)] + \lambda_{freq} L_{freq} + SSIM(S_i,S_o)$. Experiments on MNIST show improvements in PSNR and competitive SSIM relative to baselines, demonstrating that local spectral cues can effectively reduce VAE blur and suggesting broader applicability beyond MNIST.

Abstract

Variational Autoencoders (VAEs) are powerful generative models, however their generated samples are known to suffer from a characteristic blurriness, as compared to the outputs of alternative generating techniques. Extensive research efforts have been made to tackle this problem, and several works have focused on modifying the reconstruction term of the evidence lower bound (ELBO). In particular, many have experimented with augmenting the reconstruction loss with losses in the frequency domain. Such loss functions usually employ the Fourier transform to explicitly penalise the lack of higher frequency components in the generated samples, which are responsible for sharp visual features. In this paper, we explore the aspects of previous such approaches which aren't well understood, and we propose an augmentation to the reconstruction term in response to them. Our reasoning leads us to use the short-time Fourier transform and to emphasise on local phase coherence between the input and output samples. We illustrate the potential of our proposed loss on the MNIST dataset by providing both qualitative and quantitative results.

Short-Time Fourier Transform for deblurring Variational Autoencoders

TL;DR

Variational Autoencoders often produce blurry samples due to their ELBO optimization. The paper proposes a reconstruction objective that uses a short-time Fourier transform (STFT) to enforce local spectral coherence, emphasizing high-frequency details through a phase-weighted loss. The local frequency loss is computed from STFT-derived amplitudes and phases with a Hann window, combined with a SSIM pixel-term and the KL regularizer within the overall objective . Experiments on MNIST show improvements in PSNR and competitive SSIM relative to baselines, demonstrating that local spectral cues can effectively reduce VAE blur and suggesting broader applicability beyond MNIST.

Abstract

Variational Autoencoders (VAEs) are powerful generative models, however their generated samples are known to suffer from a characteristic blurriness, as compared to the outputs of alternative generating techniques. Extensive research efforts have been made to tackle this problem, and several works have focused on modifying the reconstruction term of the evidence lower bound (ELBO). In particular, many have experimented with augmenting the reconstruction loss with losses in the frequency domain. Such loss functions usually employ the Fourier transform to explicitly penalise the lack of higher frequency components in the generated samples, which are responsible for sharp visual features. In this paper, we explore the aspects of previous such approaches which aren't well understood, and we propose an augmentation to the reconstruction term in response to them. Our reasoning leads us to use the short-time Fourier transform and to emphasise on local phase coherence between the input and output samples. We illustrate the potential of our proposed loss on the MNIST dataset by providing both qualitative and quantitative results.
Paper Structure (7 sections, 10 equations, 2 figures, 1 table)

This paper contains 7 sections, 10 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of the significance of local phase in blur perception. (a) original image; (b) blurred image obtained by convolving with a circular-symmetric Gaussian low-pass filter; (c) image (a) with high-frequency band energy reduced to match that of (b)'s; (d) image (b) with high-frequency band energy elevated to match that of (a)'s.
  • Figure 2: Samples generated after training the VAE on the MNIST dataset. In (a) we see the characteristic blurriness of VAEs by using MSE as the reconstruction loss term. In (b) we already see an improvement over (a) but the digits still possess a degree of roughness and blur. In (c) we notice that both blurriness and roughness have reduced slightly.