Table of Contents
Fetching ...

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello

TL;DR

ENet targets real-time semantic segmentation on mobile and embedded platforms by delivering an ultra-light encoder-decoder with bottleneck modules, asymmetric and dilated convolutions, and information-preserving downsampling. It achieves substantial speedups (up to 18x) and huge reductions in FLOPs and parameters while maintaining competitive accuracy on Cityscapes, CamVid, and SUN RGB-D, and demonstrates real-time performance on embedded hardware like the NVIDIA TX1. The work emphasizes end-to-end efficiency without post-processing, analyzes hardware/software bottlenecks, and outlines practical avenues (kernel fusion, cuDNN improvements) to push speed even further. Overall, ENet offers a practical, scalable solution for on-device scene understanding with broad implications for mobile robotics and AR/VR applications.

Abstract

The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\times$ faster, requires 75$\times$ less FLOPs, has 79$\times$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

TL;DR

ENet targets real-time semantic segmentation on mobile and embedded platforms by delivering an ultra-light encoder-decoder with bottleneck modules, asymmetric and dilated convolutions, and information-preserving downsampling. It achieves substantial speedups (up to 18x) and huge reductions in FLOPs and parameters while maintaining competitive accuracy on Cityscapes, CamVid, and SUN RGB-D, and demonstrates real-time performance on embedded hardware like the NVIDIA TX1. The work emphasizes end-to-end efficiency without post-processing, analyzes hardware/software bottlenecks, and outlines practical avenues (kernel fusion, cuDNN improvements) to push speed even further. Overall, ENet offers a practical, scalable solution for on-device scene understanding with broad implications for mobile robotics and AR/VR applications.

Abstract

The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18 faster, requires 75 less FLOPs, has 79 less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

Paper Structure

This paper contains 22 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: ENet predictions on different datasets (left to right Cityscapes, CamVid, and SUN).
  • Figure 2: (a) ENet initial block. MaxPooling is performed with non-overlapping $2 \times 2$ windows, and the convolution has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by szegedy2015rethinking. (b) ENet bottleneck module. conv is either a regular, dilated, or full convolution (also known as deconvolution) with $3 \times 3$ filters, or a $5 \times 5$ convolution decomposed into two asymmetric ones.
  • Figure 3: PReLU weight distribution vs network depth. Blue line is the weights mean, while an area between maximum and minimum weight is grayed out. Each vertical dotted line corresponds to a PReLU in the main branch and marks the boundary between each of bottleneck blocks. The gray vertical line at 67th module is placed at encoder-decoder border.
  • Figure 4: ENet predictions on Cityscapes validation set cityscape2016
  • Figure 5: ENet predictions on CamVid test set camvid08
  • ...and 1 more figures