Lightweight Embedded FPGA Deployment of Learned Image Compression with Knowledge Distillation and Hybrid Quantization
Alaa Mazouz, Sumanta Chaudhuri, Marco Cagnanzzo, Mihai Mitrea, Enzo Tartaglione, Attilio Fiandrotti
TL;DR
This work tackles hardware-efficient LIC deployment on embedded FPGA by shifting the design burden to model dimensioning and training-time strategies. It introduces a knowledge-distillation framework to produce lean student LICs, a hardware-friendly GDN core, channel pruning, and mixed-precision quantization, all assembled into a fully pipelined FPGA deployment. The approach yields notable RD-efficiency gains and real-time 720p throughput on a ZCU102, outperforming prior FPGA LIC methods in RD and latency while maintaining energy efficiency. The analytical FPS model guides design decisions, and extensive ablations demonstrate the contributions of KD, GDN, quantization, pruning, and pipelining to the overall performance. This work advances practical LIC deployment for edge devices, enabling higher-quality compression under strict hardware constraints.
Abstract
Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization - GDN activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model.
