Table of Contents
Fetching ...

GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow

Rui Wan, Qi Zheng, Ruoyu Zhang, Bu Chen, Jiaming Liu, Min Li, Minge Jing, Jinjia Zhou, Yibo Fan

TL;DR

The paper addresses edge deployment of animation-based generative codecs (AGC) by proposing a hardware-software co-design on FPGAs that combines layer fusion and post-training static quantization. It introduces an overlapped accelerator with AGC-specific PEs (e.g., grid_sample, upsampling) and a tile-based dataflow to efficiently decode AGC streams on resource-constrained devices. Experiments on VoxCeleb and Xiph datasets show AGC achieving better perceptual quality at ultra-low bitrates and substantially higher energy efficiency compared to CPU and GPU baselines, with a practical FPGA prototype on PYNQ-Z1 demonstrating feasibility for edge deployments. The work provides a practical deployment flow and demonstrates that FPGA-based AGC decoding can offer scalable, energy-efficient edge video services for talking-face communications.

Abstract

The Animation-based Generative Codec (AGC) is an emerging paradigm for talking-face video compression. However, deploying its intricate decoder on resource and power-constrained edge devices presents challenges due to numerous parameters, the inflexibility to adapt to dynamically evolving algorithms, and the high power consumption induced by extensive computations and data transmission. This paper for the first time proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services. Initially, we analyze the AGC algorithm and employ network compression methods including post-training static quantization and layer fusion techniques. Subsequently, we design an overlapped accelerator utilizing the co-processor paradigm to perform computations through software-hardware co-design. The hardware processing unit comprises engines such as convolution, grid sampling, upsample, etc. Parallelization optimization strategies like double-buffered pipelines and loop unrolling are employed to fully exploit the resources of FPGA. Ultimately, we establish an AGC FPGA prototype on the PYNQ-Z1 platform using the proposed scheme, achieving \textbf{24.9$\times$} and \textbf{4.1$\times$} higher energy efficiency against commercial Central Processing Unit (CPU) and Graphic Processing Unit (GPU), respectively. Specifically, only \textbf{11.7} microjoules ($\upmu$J) are required for one pixel reconstructed by this FPGA system.

GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow

TL;DR

The paper addresses edge deployment of animation-based generative codecs (AGC) by proposing a hardware-software co-design on FPGAs that combines layer fusion and post-training static quantization. It introduces an overlapped accelerator with AGC-specific PEs (e.g., grid_sample, upsampling) and a tile-based dataflow to efficiently decode AGC streams on resource-constrained devices. Experiments on VoxCeleb and Xiph datasets show AGC achieving better perceptual quality at ultra-low bitrates and substantially higher energy efficiency compared to CPU and GPU baselines, with a practical FPGA prototype on PYNQ-Z1 demonstrating feasibility for edge deployments. The work provides a practical deployment flow and demonstrates that FPGA-based AGC decoding can offer scalable, energy-efficient edge video services for talking-face communications.

Abstract

The Animation-based Generative Codec (AGC) is an emerging paradigm for talking-face video compression. However, deploying its intricate decoder on resource and power-constrained edge devices presents challenges due to numerous parameters, the inflexibility to adapt to dynamically evolving algorithms, and the high power consumption induced by extensive computations and data transmission. This paper for the first time proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services. Initially, we analyze the AGC algorithm and employ network compression methods including post-training static quantization and layer fusion techniques. Subsequently, we design an overlapped accelerator utilizing the co-processor paradigm to perform computations through software-hardware co-design. The hardware processing unit comprises engines such as convolution, grid sampling, upsample, etc. Parallelization optimization strategies like double-buffered pipelines and loop unrolling are employed to fully exploit the resources of FPGA. Ultimately, we establish an AGC FPGA prototype on the PYNQ-Z1 platform using the proposed scheme, achieving \textbf{24.9} and \textbf{4.1} higher energy efficiency against commercial Central Processing Unit (CPU) and Graphic Processing Unit (GPU), respectively. Specifically, only \textbf{11.7} microjoules (J) are required for one pixel reconstructed by this FPGA system.

Paper Structure

This paper contains 27 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The AGC model and decoding system on FPGA platform. (a) is the AGC pipeline. The source frame $F_0$ and extracted keypoints of frames $F_t$ are encoded by BPG and entropy coding, respectively. The decoder estimates the sparse motion between $F_0$ and $F_t$ first. Then the dense motion is predicted by a neural network and $\hat{F_t}$ is reconstructed by a network finally. (b) is the high-level diagram of the FPGA system on which the decoder is deployed. PS and PL serve as preprocessor and neural network accelerator, respectively.
  • Figure 2: The detailed architecture of Dense Motion Network and frame generation. Here, we take the example of an input frame resolution of 256$\times$256 and a keypoint count of 10. The background color here corresponds one-to-one with the modules' color in Fig.\ref{['all']}(a).
  • Figure 3: The overall deployment flow of AGC.
  • Figure 4: SoC-level diagram. The SoC is divided into two distinct subsystems: the Processing System (PS) and the Programmable Logic (PL).
  • Figure 5: An example of off-on-off-chip tile-based convolution dataflow. Data movement marked with "*" keeps happening when the output tile stores partial sums. Data movement marked with "**" only happens when all the tiles along channel_in dimension are computed.
  • ...and 7 more figures