Table of Contents
Fetching ...

Compressing Human Body Video with Interactive Semantics: A Generative Approach

Bolin Chen, Shanzhi Yin, Hanwei Zhu, Lingyu Zhu, Zihan Zhang, Jie Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye

TL;DR

This work tackles the challenge of ultra-low bitrate human body video by introducing Interactive Human Video Coding (IHVC), which embeds semantic-level representations into the bitstream to enable direct interaction with the reconstructed signal. The approach combines a VVC texture reference with a compact 31‑D semantic vector derived from a 3D human model (OSX), a mesh-based motion estimation module, and GAN-based frame synthesis to produce high-quality reconstructions with controllable head/p body poses. Key contributions include a 31‑D interoperable semantic encoding scheme, a template-based 3D mesh reconstruction pipeline, a SPADE-based motion estimation network for dense flow and occlusion fields, and end-to-end optimization using perceptual and distortion losses. The framework demonstrates competitive rate–distortion performance at ultra-low bitrates and enables interactive editing of facial and body semantics, offering significant potential for metaverse-style digital human communication and other interactive video applications.

Abstract

In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.

Compressing Human Body Video with Interactive Semantics: A Generative Approach

TL;DR

This work tackles the challenge of ultra-low bitrate human body video by introducing Interactive Human Video Coding (IHVC), which embeds semantic-level representations into the bitstream to enable direct interaction with the reconstructed signal. The approach combines a VVC texture reference with a compact 31‑D semantic vector derived from a 3D human model (OSX), a mesh-based motion estimation module, and GAN-based frame synthesis to produce high-quality reconstructions with controllable head/p body poses. Key contributions include a 31‑D interoperable semantic encoding scheme, a template-based 3D mesh reconstruction pipeline, a SPADE-based motion estimation network for dense flow and occlusion fields, and end-to-end optimization using perceptual and distortion losses. The framework demonstrates competitive rate–distortion performance at ultra-low bitrates and enables interactive editing of facial and body semantics, offering significant potential for metaverse-style digital human communication and other interactive video applications.

Abstract

In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: Overview of the proposed IHVC framework towards low-bandwidth and enhanced-interactivity human body video communication.
  • Figure 2: Overall RD performance comparisons with VVC bross2021overview, MRAA mraa, TPSM tpsm, CFTE CHEN2022DCC and LIA lia in terms of DISTS and LPIPS.
  • Figure 3: Illustration of 15 testing sequences selected and pre-processed from TED-Talk dataset mraa.
  • Figure 4: Visual quality comparisons among VVC bross2021overview, MRAA mraa, TPSM tpsm, CFTE CHEN2022DCC, LIA lia and IHVC (Proposed) at the similar bit rate.
  • Figure 5: Illustration of interactive human body coding regarding different semantics on head posture, body posture and body location.