Table of Contents
Fetching ...

A Systematic Performance Analysis of Deep Perceptual Loss Networks: Breaking Transfer Learning Conventions

Gustav Grund Pihlgren, Konstantina Nikolaidou, Prakash Chandra Chhipa, Nosheen Abid, Rajkumar Saini, Fredrik Sandin, Marcus Liwicki

TL;DR

The paper addresses how to choose loss networks for deep perceptual loss by systematically evaluating 14 ImageNet-pretrained architectures across four extraction layers and four benchmarks (perceptual similarity, SR, segmentation, and autoencoding). The authors reveal that the extraction layer is as important as the architecture, with VGG networks without batch normalization often performing best and no simple correlation between ImageNet accuracy and downstream performance. They also show that two common transfer-learning conventions do not hold for deep perceptual loss: higher ImageNet accuracy does not guarantee better loss performance, and later-layer features are not always superior. These findings yield practical guidance for selecting loss networks and highlight the need to reevaluate transfer-learning assumptions in perceptual-loss applications, with implications for efficiency and downstream task quality.

Abstract

In recent years, deep perceptual loss has been widely and successfully used to train machine learning models for many computer vision tasks, including image synthesis, segmentation, and autoencoding. Deep perceptual loss is a type of loss function for images that computes the error between two images as the distance between deep features extracted from a neural network. Most applications of the loss use pretrained networks called loss networks for deep feature extraction. However, despite increasingly widespread use, the effects of loss network implementation on the trained models have not been studied. This work rectifies this through a systematic evaluation of the effect of different pretrained loss networks on four different application areas. Specifically, the work evaluates 14 different pretrained architectures with four different feature extraction layers. The evaluation reveals that VGG networks without batch normalization have the best performance and that the choice of feature extraction layer is at least as important as the choice of architecture. The analysis also reveals that deep perceptual loss does not adhere to the transfer learning conventions that better ImageNet accuracy implies better downstream performance and that feature extraction from the later layers provides better performance.

A Systematic Performance Analysis of Deep Perceptual Loss Networks: Breaking Transfer Learning Conventions

TL;DR

The paper addresses how to choose loss networks for deep perceptual loss by systematically evaluating 14 ImageNet-pretrained architectures across four extraction layers and four benchmarks (perceptual similarity, SR, segmentation, and autoencoding). The authors reveal that the extraction layer is as important as the architecture, with VGG networks without batch normalization often performing best and no simple correlation between ImageNet accuracy and downstream performance. They also show that two common transfer-learning conventions do not hold for deep perceptual loss: higher ImageNet accuracy does not guarantee better loss performance, and later-layer features are not always superior. These findings yield practical guidance for selecting loss networks and highlight the need to reevaluate transfer-learning assumptions in perceptual-loss applications, with implications for efficiency and downstream task quality.

Abstract

In recent years, deep perceptual loss has been widely and successfully used to train machine learning models for many computer vision tasks, including image synthesis, segmentation, and autoencoding. Deep perceptual loss is a type of loss function for images that computes the error between two images as the distance between deep features extracted from a neural network. Most applications of the loss use pretrained networks called loss networks for deep feature extraction. However, despite increasingly widespread use, the effects of loss network implementation on the trained models have not been studied. This work rectifies this through a systematic evaluation of the effect of different pretrained loss networks on four different application areas. Specifically, the work evaluates 14 different pretrained architectures with four different feature extraction layers. The evaluation reveals that VGG networks without batch normalization have the best performance and that the choice of feature extraction layer is at least as important as the choice of architecture. The analysis also reveals that deep perceptual loss does not adhere to the transfer learning conventions that better ImageNet accuracy implies better downstream performance and that feature extraction from the later layers provides better performance.
Paper Structure (35 sections, 6 equations, 5 figures, 11 tables)

This paper contains 35 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: The procedure followed in this work. This work investigates the effect of loss networks with different ImageNet deng2009imagenet pretrained architectures and feature extraction layers on the downstream performance of deep perceptual loss and similarity applications. Loss networks with $14$ pretrained architectures are examined for four different feature extraction layers by evaluating them on four application areas of deep perceptual loss and similarity. For each application area, a benchmark derived from a prior work is used for evaluation zhang2018unreasonablejohnson2016perceptualmosinska2018beyondpihlgren2021pretraining. The attributes and performance of each loss network are analyzed and cross-referenced with the other loss networks to uncover which attributes are correlated with performance and other trends. This work makes novel contributions (blue, round-corner) regarding feature extraction layers, systematic analysis of loss networks, and systematic evaluation on Benchmarks 2 through 4. The major contribution of this work (green, cut-corner) is the large analysis and cross-reference of the attributes and performance scores.
  • Figure 2: The results of each loss network ordered by extraction layer (earliest to latest) for some performance scores for each benchmark.
  • Figure 3: The performance of all loss networks on Benchmark 4 as measured by the downstream accuracy on SVHN and STL-10 where the loss networks have been grouped by extraction layer. More figures like these can be quickly generated in the supplementary spreadsheet, using any combination of investigated attributes and performance scores.
  • Figure 4: The performance of all loss networks on the 2AFC split of BAPPS compared to the $log_{10}$ amount of flops in a forward pass of that loss network.
  • Figure 5: The MRD Quality performance of each loss network compared to the ImageNet top-1 accuracy or the pretrained models.