Table of Contents
Fetching ...

UberNet: Training a `Universal' Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory

Iasonas Kokkinos

TL;DR

The paper addresses the problem of training a single convolutional neural network to perform a broad portfolio of vision tasks spanning low-, mid-, and high-level vision. It introduces UberNet, a VGG-based trunk with skip-layer fusion and multi-task heads that can be trained end-to-end on diverse, incompletely annotated datasets using an asynchronous, memory-efficient training scheme. Empirical results demonstrate competitive performance across boundary detection, surface normals, saliency, semantic segmentation, semantic boundaries, human parts, and object detection, with runtimes around 0.6–0.7 seconds per frame on a single GPU. The work shows that memory complexity can be made largely independent of the number of tasks, enabling scalable multi-task learning, and outlines directions for future work including deeper architectures and structured prediction integration.

Abstract

In this work we introduce a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture that is trained end-to-end. Such a universal network can act like a `swiss knife' for vision tasks; we call this architecture an UberNet to indicate its overarching nature. We address two main technical challenges that emerge when broadening up the range of tasks handled by a single CNN: (i) training a deep architecture while relying on diverse training sets and (ii) training many (potentially unlimited) tasks with a limited memory budget. Properly addressing these two problems allows us to train accurate predictors for a host of tasks, without compromising accuracy. Through these advances we train in an end-to-end manner a CNN that simultaneously addresses (a) boundary detection (b) normal estimation (c) saliency estimation (d) semantic segmentation (e) human part segmentation (f) semantic boundary detection, (g) region proposal generation and object detection. We obtain competitive performance while jointly addressing all of these tasks in 0.7 seconds per frame on a single GPU. A demonstration of this system can be found at http://cvn.ecp.fr/ubernet/.

UberNet: Training a `Universal' Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory

TL;DR

The paper addresses the problem of training a single convolutional neural network to perform a broad portfolio of vision tasks spanning low-, mid-, and high-level vision. It introduces UberNet, a VGG-based trunk with skip-layer fusion and multi-task heads that can be trained end-to-end on diverse, incompletely annotated datasets using an asynchronous, memory-efficient training scheme. Empirical results demonstrate competitive performance across boundary detection, surface normals, saliency, semantic segmentation, semantic boundaries, human parts, and object detection, with runtimes around 0.6–0.7 seconds per frame on a single GPU. The work shows that memory complexity can be made largely independent of the number of tasks, enabling scalable multi-task learning, and outlines directions for future work including deeper architectures and structured prediction integration.

Abstract

In this work we introduce a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture that is trained end-to-end. Such a universal network can act like a `swiss knife' for vision tasks; we call this architecture an UberNet to indicate its overarching nature. We address two main technical challenges that emerge when broadening up the range of tasks handled by a single CNN: (i) training a deep architecture while relying on diverse training sets and (ii) training many (potentially unlimited) tasks with a limited memory budget. Properly addressing these two problems allows us to train accurate predictors for a host of tasks, without compromising accuracy. Through these advances we train in an end-to-end manner a CNN that simultaneously addresses (a) boundary detection (b) normal estimation (c) saliency estimation (d) semantic segmentation (e) human part segmentation (f) semantic boundary detection, (g) region proposal generation and object detection. We obtain competitive performance while jointly addressing all of these tasks in 0.7 seconds per frame on a single GPU. A demonstration of this system can be found at http://cvn.ecp.fr/ubernet/.

Paper Structure

This paper contains 10 sections, 6 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: We introduce a CNN that can perform tasks spanning low-, mid- and high- level vision in a unified architecture; all results are obtained in 0.6-0.7 seconds per frame.
  • Figure 2: UberNet architecture for jointly solving multiple labelling tasks: an image pyramid is formed by successive downsampling operations, and each image is processed by a CNN with tied weights; skip layer pooling at different network layers of the VGG network ($\mathbf{C}_i$) is combined with Batch Normalization ($\mathbf{B}_i$) to provide features that are then used to form all task-specific responses ($\mathbf{E}_i^t$); these are combined across network layers ($\mathcal{F}^t$) and resolutions ($\mathcal{S}^t$) to form task-specific decisions. Loss functions at the individual-scale and fused responses are used to train task responses in a task-specific manner. For simplicity we omit interpolation, normalization and object detection layers - further details are provided in the text.
  • Figure 3: Vanilla backpropagation for a single task; memory lookup operations are indicated by black arrows, storage operations are indicated by orange and blue arrows for the forward and backward pass respectively. During the forward pass each layer stores its activation signals in the bottom boxes. During the backward pass these activation signals are combined with the gradient signals (top boxes) that are computed recursively, starting from the loss layer.
  • Figure 4: Low-memory backpropagation for a single task (same color code as in Fig. \ref{['fig:vanillasingle']}). We first store a subset of activations in memory, that then serve as 'anchor' points for running backpropagation on smaller networks. This reduces the number of layer activations/gradients that are simultaneously stored in memory.
  • Figure 5: Vanilla backpropagation for multi-task training: a naive implementation has a memory complexity $2 N (L_{C} + T L_{T} )$, where here $L_{C}=6$ is the depth of the common CNN trunk, $L_{T}=3$ is the depth of the task-specific branches and $T=2$ is the number of tasks.
  • ...and 3 more figures