Table of Contents
Fetching ...

Lightweight Embedded FPGA Deployment of Learned Image Compression with Knowledge Distillation and Hybrid Quantization

Alaa Mazouz, Sumanta Chaudhuri, Marco Cagnanzzo, Mihai Mitrea, Enzo Tartaglione, Attilio Fiandrotti

TL;DR

This work tackles hardware-efficient LIC deployment on embedded FPGA by shifting the design burden to model dimensioning and training-time strategies. It introduces a knowledge-distillation framework to produce lean student LICs, a hardware-friendly GDN core, channel pruning, and mixed-precision quantization, all assembled into a fully pipelined FPGA deployment. The approach yields notable RD-efficiency gains and real-time 720p throughput on a ZCU102, outperforming prior FPGA LIC methods in RD and latency while maintaining energy efficiency. The analytical FPS model guides design decisions, and extensive ablations demonstrate the contributions of KD, GDN, quantization, pruning, and pipelining to the overall performance. This work advances practical LIC deployment for edge devices, enabling higher-quality compression under strict hardware constraints.

Abstract

Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization - GDN activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model.

Lightweight Embedded FPGA Deployment of Learned Image Compression with Knowledge Distillation and Hybrid Quantization

TL;DR

This work tackles hardware-efficient LIC deployment on embedded FPGA by shifting the design burden to model dimensioning and training-time strategies. It introduces a knowledge-distillation framework to produce lean student LICs, a hardware-friendly GDN core, channel pruning, and mixed-precision quantization, all assembled into a fully pipelined FPGA deployment. The approach yields notable RD-efficiency gains and real-time 720p throughput on a ZCU102, outperforming prior FPGA LIC methods in RD and latency while maintaining energy efficiency. The analytical FPS model guides design decisions, and extensive ablations demonstrate the contributions of KD, GDN, quantization, pruning, and pipelining to the overall performance. This work advances practical LIC deployment for edge devices, enabling higher-quality compression under strict hardware constraints.

Abstract

Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization - GDN activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model.

Paper Structure

This paper contains 24 sections, 14 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: The proposed workflow for training, optimizing, distilling, and deploying LIC models on hardware is comprehensive yet abstracts hardware-specific compilation through Xilinx VITIS-AI APIs
  • Figure 2: Hyper-prior models capture broader image features, while context entropy model looks at the already decoded neighboring pixels (the causal context) and predicts the distribution of the next pixel based on that context
  • Figure 3: Custom GDN/iGDN core integration, red for GDN, blue for iGDN pipeline using Square Unit, Multiply Unit, Add Unit and Division Unit
  • Figure 4: Registering the custom GDN core with the XIR
  • Figure 5: The model is iteratively pruned 10% of its filters in three iterations until 30% sparsity is achieved, fine-tuning restoring the lost RD-efficiency.
  • ...and 12 more figures