Table of Contents
Fetching ...

DPUConfig: Optimizing ML Inference in FPGAs Using Reinforcement Learning

Alexandros Patras, Spyros Lalis, Christos D. Antonopoulos, Nikolaos Bellas

TL;DR

This work tackles efficient ML inference on FPGA MPSoCs by dynamically configuring Deep Learning Processing Units (DPUs) via a reinforcement learning agent. DPUConfig observes runtime telemetry and model features, selects among 26 DPU configurations, and adaptively reconfigures the FPGA to maximize energy efficiency under latency constraints, using offline PPO training with context-aware rewards. On a Xilinx ZCU102 platform, it achieves roughly $95$–$97 ext{%}$ of the optimal power-per-workload metric $PPW$ across varied models and workloads, with manageable reconfiguration overhead and most cases meeting the 30 FPS target. The results underscore the practicality of RL-driven runtime adaptation for energy-efficient, heterogeneous FPGA-based ML inference.

Abstract

Heterogeneous embedded systems, with diverse computing elements and accelerators such as FPGAs, offer a promising platform for fast and flexible ML inference, which is crucial for services such as autonomous driving and augmented reality, where delays can be costly. However, efficiently allocating computational resources for deep learning applications in FPGA-based systems is a challenging task. A Deep Learning Processor Unit (DPU) is a parameterizable FPGA-based accelerator module optimized for ML inference. It supports a wide range of ML models and can be instantiated multiple times within a single FPGA to enable concurrent execution. This paper introduces DPUConfig, a novel runtime management framework, based on a custom Reinforcement Learning (RL) agent, that dynamically selects optimal DPU configurations by leveraging real-time telemetry data monitoring, system utilization, power consumption, and application performance to inform its configuration selection decisions. The experimental evaluation demonstrates that the RL agent achieves energy efficiency 95% (on average) of the optimal attainable energy efficiency for several CNN models on the Xilinx Zynq UltraScale+ MPSoC ZCU102.

DPUConfig: Optimizing ML Inference in FPGAs Using Reinforcement Learning

TL;DR

This work tackles efficient ML inference on FPGA MPSoCs by dynamically configuring Deep Learning Processing Units (DPUs) via a reinforcement learning agent. DPUConfig observes runtime telemetry and model features, selects among 26 DPU configurations, and adaptively reconfigures the FPGA to maximize energy efficiency under latency constraints, using offline PPO training with context-aware rewards. On a Xilinx ZCU102 platform, it achieves roughly of the optimal power-per-workload metric across varied models and workloads, with manageable reconfiguration overhead and most cases meeting the 30 FPS target. The results underscore the practicality of RL-driven runtime adaptation for energy-efficient, heterogeneous FPGA-based ML inference.

Abstract

Heterogeneous embedded systems, with diverse computing elements and accelerators such as FPGAs, offer a promising platform for fast and flexible ML inference, which is crucial for services such as autonomous driving and augmented reality, where delays can be costly. However, efficiently allocating computational resources for deep learning applications in FPGA-based systems is a challenging task. A Deep Learning Processor Unit (DPU) is a parameterizable FPGA-based accelerator module optimized for ML inference. It supports a wide range of ML models and can be instantiated multiple times within a single FPGA to enable concurrent execution. This paper introduces DPUConfig, a novel runtime management framework, based on a custom Reinforcement Learning (RL) agent, that dynamically selects optimal DPU configurations by leveraging real-time telemetry data monitoring, system utilization, power consumption, and application performance to inform its configuration selection decisions. The experimental evaluation demonstrates that the RL agent achieves energy efficiency 95% (on average) of the optimal attainable energy efficiency for several CNN models on the Xilinx Zynq UltraScale+ MPSoC ZCU102.
Paper Structure (13 sections, 6 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: The optimal execution target depends on ML characteristics. The bars (left axis) show energy efficiency in FPS per Watt, and the red points (right axis) indicate performance.
  • Figure 2: PPW (left axis, bars) and performance in FPS (right axis, points) across different DPU configurations under three system states. The dark bars highlight the configuration achieving the best energy efficiency while maintaining performance above 30 FPS.
  • Figure 3: PPW (left axis, bars) and accuracy (right axis, lines) across different DPU configurations under the N state. For example, the accuracy of ResNet152 when 25% of its channels are eliminated is 66.64%.
  • Figure 4: High-level design of the DPUConfig framework.
  • Figure 5: Normalized PPW results of DPUConfig across two workload states (C, M). Model Abbr.: RegX = RegNetX, Inc3 = InceptionV3, R152 = ResNet152. PR0, PR25, PR50 denote pruning ratios of 0%, 25%, and 50%, respectively.
  • ...and 1 more figures