Table of Contents
Fetching ...

Facial Expression Recognition System Using DNN Accelerator with Multi-threading on FPGA

Takuto Ando, Yusuke Inoue

TL;DR

This paper tackles real-time facial expression recognition on an embedded FPGA by executing both face detection and facial expression recognition on a single DPU via time-sharing and a multi-threaded pipeline. DenseBox-based face detection and FER trained on FER-2013 are offloaded to the DPU, with pre-processing on the CPU, enabling a compact, power-efficient on-board system. The approach yields 25 FPS throughput with ~9.3 FPS/W efficiency and improves detection accuracy by ~1.73x and reduces detection latency dramatically versus Haar Cascade baselines, while achieving 67.4% FER accuracy. The results demonstrate a practical strategy for dual-DNN execution on a single accelerator, guiding design choices such as DPU size (B512) and operating frequency (≈400 MHz) for embedded real-time vision tasks.

Abstract

In this paper, we implement a stand-alone facial expression recognition system on an SoC FPGA with multi-threading using a Deep learning Processor Unit (DPU). The system consists of two steps: one for face detection step and one for facial expression recognition. In the previous work, the Haar Cascade detector was run on a CPU in the face detection step due to FPGA resource limitations, but this detector is less accurate for profile and variable illumination condition images. Moreover, the previous work used a dedicated circuit accelerator, so running a second DNN inference for face detection on the FPGA would require the addition of a new accelerator. As an alternative to this approach, we run the two inferences by DNN on a DPU, which is a general-purpose CNN accelerator of the systolic array type. Our method for face detection using DenseBox and facial expression recognition using CNN on the same DPU enables the efficient use of FPGA resources while maintaining a small circuit size. We also developed a multi-threading technique that improves the overall throughput while increasing the DPU utilization efficiency. With this approach, we achieved an overall system throughput of 25 FPS and a throughput per power consumption of 2.4 times.

Facial Expression Recognition System Using DNN Accelerator with Multi-threading on FPGA

TL;DR

This paper tackles real-time facial expression recognition on an embedded FPGA by executing both face detection and facial expression recognition on a single DPU via time-sharing and a multi-threaded pipeline. DenseBox-based face detection and FER trained on FER-2013 are offloaded to the DPU, with pre-processing on the CPU, enabling a compact, power-efficient on-board system. The approach yields 25 FPS throughput with ~9.3 FPS/W efficiency and improves detection accuracy by ~1.73x and reduces detection latency dramatically versus Haar Cascade baselines, while achieving 67.4% FER accuracy. The results demonstrate a practical strategy for dual-DNN execution on a single accelerator, guiding design choices such as DPU size (B512) and operating frequency (≈400 MHz) for embedded real-time vision tasks.

Abstract

In this paper, we implement a stand-alone facial expression recognition system on an SoC FPGA with multi-threading using a Deep learning Processor Unit (DPU). The system consists of two steps: one for face detection step and one for facial expression recognition. In the previous work, the Haar Cascade detector was run on a CPU in the face detection step due to FPGA resource limitations, but this detector is less accurate for profile and variable illumination condition images. Moreover, the previous work used a dedicated circuit accelerator, so running a second DNN inference for face detection on the FPGA would require the addition of a new accelerator. As an alternative to this approach, we run the two inferences by DNN on a DPU, which is a general-purpose CNN accelerator of the systolic array type. Our method for face detection using DenseBox and facial expression recognition using CNN on the same DPU enables the efficient use of FPGA resources while maintaining a small circuit size. We also developed a multi-threading technique that improves the overall throughput while increasing the DPU utilization efficiency. With this approach, we achieved an overall system throughput of 25 FPS and a throughput per power consumption of 2.4 times.

Paper Structure

This paper contains 17 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: System configuration of proposed method
  • Figure 2: Sample images from FER-2013 dataset
  • Figure 3: Hardware architecturedpu
  • Figure 4: Process flow of face detection and facial expression recognition by single thread
  • Figure 5: Process flow of face detection and facial expression recognition by multi thread
  • ...and 2 more figures