Table of Contents
Fetching ...

Learning Quantized Continuous Controllers for Integer Hardware

Fabian Kresse, Christoph H. Lampert

TL;DR

This work addresses the challenge of running continuous-control reinforcement learning policies on resource-constrained integer hardware by introducing quantization-aware training (QAT) to produce 2–3 bit policies that can be deployed as integer-only networks on an Artix-7 FPGA. The authors present a complete learning-to-hardware pipeline, including a QDQ-based training approach and a FINN-based hardware synthesis flow, to automatically select layer widths and bitwidths that preserve FP32 performance on five MuJoCo tasks. Across SAC and, in some cases, DDPG, the quantized policies achieve FP32 parity with microsecond-scale inference latencies and microjoule-per-action energy consumption, while also exhibiting enhanced robustness to input noise. The study demonstrates a practical path to energy-efficient, low-latency RL controllers for embedded systems and outlines a three-step model-selection procedure to automatically tailor policies to hardware constraints, with results showing substantial hardware efficiency gains over strong references.

Abstract

Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

Learning Quantized Continuous Controllers for Integer Hardware

TL;DR

This work addresses the challenge of running continuous-control reinforcement learning policies on resource-constrained integer hardware by introducing quantization-aware training (QAT) to produce 2–3 bit policies that can be deployed as integer-only networks on an Artix-7 FPGA. The authors present a complete learning-to-hardware pipeline, including a QDQ-based training approach and a FINN-based hardware synthesis flow, to automatically select layer widths and bitwidths that preserve FP32 performance on five MuJoCo tasks. Across SAC and, in some cases, DDPG, the quantized policies achieve FP32 parity with microsecond-scale inference latencies and microjoule-per-action energy consumption, while also exhibiting enhanced robustness to input noise. The study demonstrates a practical path to energy-efficient, low-latency RL controllers for embedded systems and outlines a three-step model-selection procedure to automatically tailor policies to hardware constraints, with results showing substantial hardware efficiency gains over strong references.

Abstract

Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

Paper Structure

This paper contains 20 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Reward vs. bitwidth for full-precision (FP32) baselines (shaded region indicates one standard deviation) as well as four variants of network quantization: all: all network operations are quantized to indicated bitwidths; input/output: the quantization of only the inputs/outputs are varied; core: the quantization of weights and internal activations are varied. In the latter three cases, all other components are left at 8-bit precision. We achieve FP32-parity with SAC and DDPG across most quantization scopes and environments. See main text for further discussion of the curves.
  • Figure 2: Evaluation reward across training time steps for our environments with SAC. Shaded bands show standard deviation over trained models. Overall, the selected quantized models show comparable convergence behavior to the floating-point baseline.
  • Figure 3: Robustness to observation input noise. Reward vs. noise level, $\sigma$, for floating-point and selected QAT policies on MuJoCo tasks. Shaded bands show standard deviation over trained models. The quantized, selected model performs better, or on par, with the FP32 baseline under injection.
  • Figure 4: Return vs. hidden width for SAC under the minimal FP32-matching core precision (2-bit except 3-bit for HalfCheetah/Humanoid). FP32 mean and its one-standard deviation band shown for reference.
  • Figure 5: Return vs. input quantization for SAC under the configuration from Table \ref{['tab:final-config']}, except input bits, which is swept here. FP32 mean and its one-standard deviation band shown for reference.
  • ...and 1 more figures