ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Adam Paszke, Abhishek Chaurasia, Sangpil Kim, Eugenio Culurciello
TL;DR
ENet targets real-time semantic segmentation on mobile and embedded platforms by delivering an ultra-light encoder-decoder with bottleneck modules, asymmetric and dilated convolutions, and information-preserving downsampling. It achieves substantial speedups (up to 18x) and huge reductions in FLOPs and parameters while maintaining competitive accuracy on Cityscapes, CamVid, and SUN RGB-D, and demonstrates real-time performance on embedded hardware like the NVIDIA TX1. The work emphasizes end-to-end efficiency without post-processing, analyzes hardware/software bottlenecks, and outlines practical avenues (kernel fusion, cuDNN improvements) to push speed even further. Overall, ENet offers a practical, scalable solution for on-device scene understanding with broad implications for mobile robotics and AR/VR applications.
Abstract
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\times$ faster, requires 75$\times$ less FLOPs, has 79$\times$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.
