Table of Contents
Fetching ...

Multi-Scale Dense Networks for Resource Efficient Image Classification

Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, Kilian Q. Weinberger

TL;DR

The paper tackles inference-time resource constraints by introducing Multi-Scale DenseNet (MSDNet), a CNN with dense inter-layer connections and a two-dimensional, multi-scale feature hierarchy that supports multiple early exits. By maintaining coarse and fine features throughout and densely connecting layers, MSDNet enables anytime prediction and budgeted batch classification with shared computation and minimal interference between exits. Empirical results on CIFAR-10/100 and ImageNet show MSDNet outperforms strong baselines across a spectrum of computational budgets, often by large margins, and a DenseNet variant highlights efficiency gains. The work demonstrates a practical path to accurate, resource-aware image classification suitable for diverse devices and large-scale systems.

Abstract

In this paper we investigate image classification with computational resource limits at test time. Two such settings are: 1. anytime classification, where the network's prediction for a test example is progressively updated, facilitating the output of a prediction at any time; and 2. budgeted batch classification, where a fixed amount of computation is available to classify a set of examples that can be spent unevenly across "easier" and "harder" inputs. In contrast to most prior work, such as the popular Viola and Jones algorithm, our approach is based on convolutional neural networks. We train multiple classifiers with varying resource demands, which we adaptively apply during test time. To maximally re-use computation between the classifiers, we incorporate them as early-exits into a single deep convolutional neural network and inter-connect them with dense connectivity. To facilitate high quality classification early on, we use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network. Experiments on three image-classification tasks demonstrate that our framework substantially improves the existing state-of-the-art in both settings.

Multi-Scale Dense Networks for Resource Efficient Image Classification

TL;DR

The paper tackles inference-time resource constraints by introducing Multi-Scale DenseNet (MSDNet), a CNN with dense inter-layer connections and a two-dimensional, multi-scale feature hierarchy that supports multiple early exits. By maintaining coarse and fine features throughout and densely connecting layers, MSDNet enables anytime prediction and budgeted batch classification with shared computation and minimal interference between exits. Empirical results on CIFAR-10/100 and ImageNet show MSDNet outperforms strong baselines across a spectrum of computational budgets, often by large margins, and a DenseNet variant highlights efficiency gains. The work demonstrates a practical path to accurate, resource-aware image classification suitable for diverse devices and large-scale systems.

Abstract

In this paper we investigate image classification with computational resource limits at test time. Two such settings are: 1. anytime classification, where the network's prediction for a test example is progressively updated, facilitating the output of a prediction at any time; and 2. budgeted batch classification, where a fixed amount of computation is available to classify a set of examples that can be spent unevenly across "easier" and "harder" inputs. In contrast to most prior work, such as the popular Viola and Jones algorithm, our approach is based on convolutional neural networks. We train multiple classifiers with varying resource demands, which we adaptively apply during test time. To maximally re-use computation between the classifiers, we incorporate them as early-exits into a single deep convolutional neural network and inter-connect them with dense connectivity. To facilitate high quality classification early on, we use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network. Experiments on three image-classification tasks demonstrate that our framework substantially improves the existing state-of-the-art in both settings.

Paper Structure

This paper contains 18 sections, 11 figures.

Figures (11)

  • Figure 1: Two images containing a horse. The left image is canonical and easy to detect even with a small model, whereas the right image requires a computationally more expensive network architecture. (Copyright Pixel Addict and Doyle (CC BY-ND 2.0).)
  • Figure 2: Illustration of the first four layers of an MSDNet with three scales. The horizontal direction corresponds to the layer direction (depth) of the network. The vertical direction corresponds to the scale of the feature maps. Horizontal arrows indicate a regular convolution operation, whereas diagonal and vertical arrows indicate a strided convolution operation. Classifiers only operate on feature maps at the coarsest scale. Connections across more than one layer are not drawn explicitly: they are implicit through recursive concatenations.
  • Figure 3: Relative accuracy of the intermediate classifier (left) and the final classifier (right) when introducing a single intermediate classifier at different layers in a ResNet, DenseNet and MSDNet. All experiments were performed on the CIFAR-100 dataset. Higher is better.
  • Figure 4: The output $\mathbf{x}_{\ell}^{s}$ of layer $\ell$ at the $s^\text{th}$ scale in a MSDNet. Herein, $[ \dots ]$ denotes the concatenation operator, $h_{\ell}^s(\cdot)$ a regular convolution transformation, and $\tilde{h}_{\ell}^s(\cdot)$ a strided convolutional. Note that the outputs of $h_{\ell}^s$ and $\tilde{h}_{\ell}^s$ have the same feature map size; their outputs are concatenated along the channel dimension.
  • Figure 5: Accuracy (top-1) of anytime prediction models as a function of computational budget on the ImageNet (left) and CIFAR-100 (right) datasets. Higher is better.
  • ...and 6 more figures