Table of Contents
Fetching ...

ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution

Tun-Hao Yang, Tian-Sheuan Chang

TL;DR

This work tackles the challenge of achieving high-quality real-time super-resolution on edge devices by co-designing a lightweight SR model and a hardware accelerator. The ACNet architecture employs decoupled asymmetric convolutions, channel-bypass blocks, and holistic model fusion to deliver a 0.34 dB PSNR improvement over FSRCNN with a 27-layer depth and roughly 36% lower complexity, while fitting a small 17K-parameter footprint on-chip. The ACNPU hardware leverages six PE clusters, boundary SRAM, and an input-stationary, parallel execution flow to minimize external memory traffic and internal bandwidth, achieving 31.7 FPS for ×2 and 124.4 FPS for ×4 Full-HD at 270 MHz with an energy efficiency of $4.75$ TOPS/W. Overall, the paper demonstrates a practical, regular hardware design that enables real-time Full-HD SR with high energy efficiency on a 40 nm process, outperforming several prior accelerators in the quality-to-cost/energy ratio.

Abstract

Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This challenge leads many accelerators to opt for simpler and shallow models like FSRCNN, compromising performance for real-time needs, especially for resource-limited edge devices. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36\% less complexity than FSRCNN, while maintaining a similar model size, with the \textit{decoupled asymmetric convolution and split-bypass structure}. The hardware-friendly 17K-parameter model enables \textit{holistic model fusion} instead of localized layer fusion to remove external DRAM access of intermediate feature maps. The on-chip memory bandwidth is further reduced with the \textit{input stationary flow} and \textit{parallel-layer execution} to reduce power consumption. Hardware is regular and easy to control to support different layers by \textit{processing elements (PEs) clusters with reconfigurable input and uniform data flow}. The implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198KB SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for x2 and x4 scales Full-HD generation, respectively, which attains 4.75 TOPS/W energy efficiency.

ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution

TL;DR

This work tackles the challenge of achieving high-quality real-time super-resolution on edge devices by co-designing a lightweight SR model and a hardware accelerator. The ACNet architecture employs decoupled asymmetric convolutions, channel-bypass blocks, and holistic model fusion to deliver a 0.34 dB PSNR improvement over FSRCNN with a 27-layer depth and roughly 36% lower complexity, while fitting a small 17K-parameter footprint on-chip. The ACNPU hardware leverages six PE clusters, boundary SRAM, and an input-stationary, parallel execution flow to minimize external memory traffic and internal bandwidth, achieving 31.7 FPS for ×2 and 124.4 FPS for ×4 Full-HD at 270 MHz with an energy efficiency of TOPS/W. Overall, the paper demonstrates a practical, regular hardware design that enables real-time Full-HD SR with high energy efficiency on a 40 nm process, outperforming several prior accelerators in the quality-to-cost/energy ratio.

Abstract

Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This challenge leads many accelerators to opt for simpler and shallow models like FSRCNN, compromising performance for real-time needs, especially for resource-limited edge devices. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36\% less complexity than FSRCNN, while maintaining a similar model size, with the \textit{decoupled asymmetric convolution and split-bypass structure}. The hardware-friendly 17K-parameter model enables \textit{holistic model fusion} instead of localized layer fusion to remove external DRAM access of intermediate feature maps. The on-chip memory bandwidth is further reduced with the \textit{input stationary flow} and \textit{parallel-layer execution} to reduce power consumption. Hardware is regular and easy to control to support different layers by \textit{processing elements (PEs) clusters with reconfigurable input and uniform data flow}. The implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198KB SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for x2 and x4 scales Full-HD generation, respectively, which attains 4.75 TOPS/W energy efficiency.
Paper Structure (26 sections, 14 figures, 7 tables)

This paper contains 26 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Comparison of different SR models in terms of model size, multiply-accumulate (MAC) and peak signal-to-noise ratio (PSNR). The yellow dots are software works and the blue dots are hardware models. The orange dot is our ACNet model.
  • Figure 2: The proposed ACNet. The convolution notation, A×B, (a, b, c), is kernel size A×B, input channel a, output channel b, and group number c, respectively.
  • Figure 3: Parameters and operations detail of the CBB.
  • Figure 4: Parameters and operations of the overall model.
  • Figure 5: The proposed system architecture.
  • ...and 9 more figures