Table of Contents
Fetching ...

Inverting Visual Representations with Convolutional Networks

Alexey Dosovitskiy, Thomas Brox

TL;DR

The paper introduces an up-convolutional network framework to invert image representations, enabling reconstruction from both traditional descriptors (HOG, SIFT, LBP) and deep CNN features (AlexNet). By learning the conditional expectation of images given feature vectors, the approach reveals what information is preserved and what is discarded by different representations, showing that colors and rough object layout can be recovered even from high-level activations and class probabilities. It also demonstrates that high-level reconstructions rely largely on activation patterns rather than precise magnitudes, and that the model learns a natural image prior that supports plausible colorization and structure from random features. The method is fast at test time and broadly applicable to arbitrary feature representations, offering new insights into the structure of visual representations and the role of invariances in CNNs.

Abstract

Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features. Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network. Most strikingly, the colors and the rough contours of an image can be reconstructed from activations in higher network layers and even from the predicted class probabilities.

Inverting Visual Representations with Convolutional Networks

TL;DR

The paper introduces an up-convolutional network framework to invert image representations, enabling reconstruction from both traditional descriptors (HOG, SIFT, LBP) and deep CNN features (AlexNet). By learning the conditional expectation of images given feature vectors, the approach reveals what information is preserved and what is discarded by different representations, showing that colors and rough object layout can be recovered even from high-level activations and class probabilities. It also demonstrates that high-level reconstructions rely largely on activation patterns rather than precise magnitudes, and that the model learns a natural image prior that supports plausible colorization and structure from random features. The method is fast at test time and broadly applicable to arbitrary feature representations, offering new insights into the structure of visual representations and the role of invariances in CNNs.

Abstract

Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features. Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network. Most strikingly, the colors and the rough contours of an image can be reconstructed from activations in higher network layers and even from the predicted class probabilities.

Paper Structure

This paper contains 17 sections, 3 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: We train convolutional networks to reconstruct images from different feature representations. Top row: Input features. Bottom row: Reconstructed image. Reconstructions from HOG and SIFT are very realistic. Reconstructions from AlexNet preserve color and rough object positions even when reconstructing from higher layers.
  • Figure 2: Reconstructing an image from its HOG descriptors with different methods.
  • Figure 3: Inversion of shallow image representations. Note how in the first row the color of grass and trees is predicted correctly in all cases, although it is not contained in the features.
  • Figure 4: Reconstructing an image from SIFT descriptors with different methods. (a) an image, (b) SIFT keypoints, (c) reconstruction of Weinzaepfel_CVPR2011, (d) our reconstruction.
  • Figure 5: Reconstructions from different layers of AlexNet.
  • ...and 16 more figures