Table of Contents
Fetching ...

Video Semantic Communication with Major Object Extraction and Contextual Video Encoding

Haopeng Li, Haonan Tong, Sihua Wang, Nuocheng Yang, Zhaohui Yang, Changchuan Yin

TL;DR

This paper addresses the challenge of transmitting massive video data over bandwidth-limited wireless channels for interactive applications. It introduces a video semantic communication framework that separates high-level semantic extraction (major object) from low-level contextual encoding, enabling substantial data reduction while preserving reconstruction quality. The MOE module detects and isolates the major foreground object, discarding redundant background, while the CVE module encodes temporal and spatial context of the foreground in a latent domain using CNN-based motion estimation, contextual extraction, and entropy coding, with an end-to-end trainable semantic decoder. Key innovations include a MobileNetV3-Large based MOE, a CNN-driven four-level optical flow-based motion estimation in CVE, a Laplacian-based entropy model with rANS coding, and a custom training algorithm that optimizes MOE and CVE modules separately before joint fine-tuning. Experiments on MOEdataset and Vimeo-90k show data transmission reductions of up to 25% and PSNR improvements up to 14% over traditional codecs, along with improved robustness to channel impairments, highlighting the practical impact for ultra-dense video transmission in future networks.

Abstract

This paper studies an end-to-end video semantic communication system for massive communication. In the considered system, the transmitter must continuously send the video to the receiver to facilitate character reconstruction in immersive applications, such as interactive video conference. However, transmitting the original video information with substantial amounts of data poses a challenge to the limited wireless resources. To address this issue, we reduce the amount of data transmitted by making the transmitter extract and send the semantic information from the video, which refines the major object and the correlation of time and space in the video. Specifically, we first develop a video semantic communication system based on major object extraction (MOE) and contextual video encoding (CVE) to achieve efficient video transmission. Then, we design the MOE and CVE modules with convolutional neural network based motion estimation, contextual extraction and entropy coding. Simulation results show that compared to the traditional coding schemes, the proposed method can reduce the amount of transmitted data by up to 25% while increasing the peak signal-to-noise ratio (PSNR) of the reconstructed video by up to 14%.

Video Semantic Communication with Major Object Extraction and Contextual Video Encoding

TL;DR

This paper addresses the challenge of transmitting massive video data over bandwidth-limited wireless channels for interactive applications. It introduces a video semantic communication framework that separates high-level semantic extraction (major object) from low-level contextual encoding, enabling substantial data reduction while preserving reconstruction quality. The MOE module detects and isolates the major foreground object, discarding redundant background, while the CVE module encodes temporal and spatial context of the foreground in a latent domain using CNN-based motion estimation, contextual extraction, and entropy coding, with an end-to-end trainable semantic decoder. Key innovations include a MobileNetV3-Large based MOE, a CNN-driven four-level optical flow-based motion estimation in CVE, a Laplacian-based entropy model with rANS coding, and a custom training algorithm that optimizes MOE and CVE modules separately before joint fine-tuning. Experiments on MOEdataset and Vimeo-90k show data transmission reductions of up to 25% and PSNR improvements up to 14% over traditional codecs, along with improved robustness to channel impairments, highlighting the practical impact for ultra-dense video transmission in future networks.

Abstract

This paper studies an end-to-end video semantic communication system for massive communication. In the considered system, the transmitter must continuously send the video to the receiver to facilitate character reconstruction in immersive applications, such as interactive video conference. However, transmitting the original video information with substantial amounts of data poses a challenge to the limited wireless resources. To address this issue, we reduce the amount of data transmitted by making the transmitter extract and send the semantic information from the video, which refines the major object and the correlation of time and space in the video. Specifically, we first develop a video semantic communication system based on major object extraction (MOE) and contextual video encoding (CVE) to achieve efficient video transmission. Then, we design the MOE and CVE modules with convolutional neural network based motion estimation, contextual extraction and entropy coding. Simulation results show that compared to the traditional coding schemes, the proposed method can reduce the amount of transmitted data by up to 25% while increasing the peak signal-to-noise ratio (PSNR) of the reconstructed video by up to 14%.
Paper Structure (16 sections, 13 equations, 9 figures, 1 algorithm)

This paper contains 16 sections, 13 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: The overall flowchart of the proposed MOE-CVE scheme.
  • Figure 2: The structure of MOE, where the Conv Layer means the convolutional layer with $|kernel\times stride\times padding|$, and the Resblock means the residual block.
  • Figure 3: Optical flow estimation network.
  • Figure 4: The structure of contextual extraction.
  • Figure 5: The structure of entropy coding.
  • ...and 4 more figures