Table of Contents
Fetching ...

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc, Andrea Vedaldi

TL;DR

The paper formalizes image representations through the notions of equivariance, invariance, and equivalence, and develops empirical methods to measure and exploit these properties. It introduces transformation and stitching layers to learn how representations respond to image transforms and how different representations can be stitched together, respectively. Across shallow descriptors like HOG and deep CNN layers, it shows that early layers are largely equivariant and interchangeable, while deeper layers become more task-specific, yet useful equivalence relations persist. The proposed framework enables practical benefits, notably accelerating structured-output regression by exploiting learned equivariant mappings.

Abstract

Despite the importance of image representations such as histograms of oriented gradients and deep Convolutional Neural Networks (CNN), our theoretical understanding of them remains limited. Aiming at filling this gap, we investigate three key mathematical properties of representations: equivariance, invariance, and equivalence. Equivariance studies how transformations of the input image are encoded by the representation, invariance being a special case where a transformation has no effect. Equivalence studies whether two representations, for example two different parametrisations of a CNN, capture the same visual information or not. A number of methods to establish these properties empirically are proposed, including introducing transformation and stitching layers in CNNs. These methods are then applied to popular representations to reveal insightful aspects of their structure, including clarifying at which layers in a CNN certain geometric invariances are achieved. While the focus of the paper is theoretical, direct applications to structured-output regression are demonstrated too.

Understanding image representations by measuring their equivariance and equivalence

TL;DR

The paper formalizes image representations through the notions of equivariance, invariance, and equivalence, and develops empirical methods to measure and exploit these properties. It introduces transformation and stitching layers to learn how representations respond to image transforms and how different representations can be stitched together, respectively. Across shallow descriptors like HOG and deep CNN layers, it shows that early layers are largely equivariant and interchangeable, while deeper layers become more task-specific, yet useful equivalence relations persist. The proposed framework enables practical benefits, notably accelerating structured-output regression by exploiting learned equivariant mappings.

Abstract

Despite the importance of image representations such as histograms of oriented gradients and deep Convolutional Neural Networks (CNN), our theoretical understanding of them remains limited. Aiming at filling this gap, we investigate three key mathematical properties of representations: equivariance, invariance, and equivalence. Equivariance studies how transformations of the input image are encoded by the representation, invariance being a special case where a transformation has no effect. Equivalence studies whether two representations, for example two different parametrisations of a CNN, capture the same visual information or not. A number of methods to establish these properties empirically are proposed, including introducing transformation and stitching layers in CNNs. These methods are then applied to popular representations to reveal insightful aspects of their structure, including clarifying at which layers in a CNN certain geometric invariances are achieved. While the focus of the paper is theoretical, direct applications to structured-output regression are demonstrated too.

Paper Structure

This paper contains 26 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Equivariant transformation of CNN filters. Top: Conv1 and Conv2 filters of a convolutional neural network visualised with the method of simonyan2013deep. Other rows: geometrically warped filters reconstructed from an equivariant transformation of the network output learned using the method of Sect. \ref{['s:learning']} for Horizontal flip, Vertical flip and Rotation $90\degree$.
  • Figure 2: Structured sparsity. Predicting equivariant features at location $(u,v)$ uses a corresponding small neighbourhood of features $\Omega_{g,m}(u,v)$.
  • Figure 3: Regression methods. The figure reports the HOG feature reconstruction error (average per-cell Hellinger distance) achieved by the learned equivariant mapping $M_g$ by setting $g$ to different image rotations (\ref{['fig:hog_mg_rot']}) and scalings (\ref{['fig:hog_mg_sc']}) for different learning strategies (see text). No other constraint is imposed on $A_g$. In the right panel (\ref{['fig:hog_mg_cell']}) the experiment is repeated for the $45^\circ$ rotation, but this time imposing structured sparsity on $A_g$ for different values of the neighbourhood size $m$.
  • Figure 4: Equivariant classification using HOG features. Classification performance of a HOG-based classifier trained to discriminate dog and cat heads as the test images are gradually rotated and scaled and the effect compensated by equivariant maps learned using LS, RR, and FS.
  • Figure 5: Qualitative evaluation of equivariant HOG. Visualisation of the features $\phi(\mathbf{x})$, $\phi(g\mathbf{x})$ and $M_g\phi(\mathbf{x})$ using the $\phi^{-1}$ HOGgle vondrick13hoggles: HOG inverse. $M_g$ is learned using FS with $k=5$ and $m=3$ and $g$ is set to a rotation by $45\degree$ and up/down-scaling by $\sqrt{2}$ respectively. The dashed boxes show the support of the reconstructed features.
  • ...and 4 more figures