Table of Contents
Fetching ...

Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation

Bingyan Xie, Yongpeng Wu, Yuxuan Shi, Biqian Feng, Wenjun Zhang, Jihong Park, Tony Quek

TL;DR

This paper presents WVSC-D, a semantic-level wireless video transmission framework that combines deep semantic video coding with a decoupled diffusion multi-frame compensation mechanism. By transmitting a reference semantic I frame and residual semantic P frames, and by polishing P frames at the receiver through GMFC and the decoupled diffusion process (DDMFC), the approach achieves notable bitrate savings and robust performance under wireless channel impairments. Key contributions include the semantic I/P frame scheme, GMFC for generation-based compensation, and the decoupled diffusion architecture that shares base noise across a GoP while generating frame-specific residuals; together they yield improvements over state-of-the-art DL-based and traditional schemes in PSNR and perceptual metrics. The proposed method demonstrates practical impact for low-latency wireless video transmission, edge computing, and IoT scenarios, with potential extension to multi-modal data.

Abstract

Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework with decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC-D first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, DDMFC is proposed to generate compensated current semantic frame by a two-stage conditional diffusion process. With both the reference frame transmission and DDMFC frame compensation, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC-D over other DL-based methods e.g. DVSC about 1.8 dB in terms of PSNR.

Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation

TL;DR

This paper presents WVSC-D, a semantic-level wireless video transmission framework that combines deep semantic video coding with a decoupled diffusion multi-frame compensation mechanism. By transmitting a reference semantic I frame and residual semantic P frames, and by polishing P frames at the receiver through GMFC and the decoupled diffusion process (DDMFC), the approach achieves notable bitrate savings and robust performance under wireless channel impairments. Key contributions include the semantic I/P frame scheme, GMFC for generation-based compensation, and the decoupled diffusion architecture that shares base noise across a GoP while generating frame-specific residuals; together they yield improvements over state-of-the-art DL-based and traditional schemes in PSNR and perceptual metrics. The proposed method demonstrates practical impact for low-latency wireless video transmission, edge computing, and IoT scenarios, with potential extension to multi-modal data.

Abstract

Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework with decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC-D first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, DDMFC is proposed to generate compensated current semantic frame by a two-stage conditional diffusion process. With both the reference frame transmission and DDMFC frame compensation, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC-D over other DL-based methods e.g. DVSC about 1.8 dB in terms of PSNR.

Paper Structure

This paper contains 27 sections, 31 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: A comparison of video compression and wireless video communication methods: (a) pixel-level video compression without communication DVCMLVCALVC. (b) pixel-level wireless video communication dvsc. (c) semantic-level wireless video communication. wvsc (d) proposed diffusion-based semantic-level wireless video communication.
  • Figure 2: (a) The proposed WVSC-D framework. The video is transmitted by a series of GoPs, which is divided into semantic I frame transmission (red lines) and semantic P frame transmission (blue lines), respectively. (b) The proposed DDMFC module. The received semantic I frames $\hat{\mathbf{f}}^\mathrm{ref}$ are compensated into current semantic P frames $\tilde{\mathbf{f}}^i$ through conditional diffusion generation.
  • Figure 3: The structure of motion estimation $\&$ compensation network.
  • Figure 4: (a) The diffusion process for the semantic I frame. (b) The proposed DDMFC module. The semantic I frame shares both the base frame and base noise to other semantic P frames in a GoP. While other semantic P frames produce unique residual noise and multi-frame condition steering to generate corresponding semantic frames at the receiver. $\mu_{\epsilon}$ denotes mean-value predicted function of DDIM in $\epsilon$-prediction formulation.
  • Figure 5: The architecture of diffusion network. (a) For the semantic I frame: U-Net (base noise generator). (b) For the semantic P frame: U-Net with MFA module (residual noise generator).
  • ...and 5 more figures