Compressing Human Body Video with Interactive Semantics: A Generative Approach
Bolin Chen, Shanzhi Yin, Hanwei Zhu, Lingyu Zhu, Zihan Zhang, Jie Chen, Ru-Ling Liao, Shiqi Wang, Yan Ye
TL;DR
This work tackles the challenge of ultra-low bitrate human body video by introducing Interactive Human Video Coding (IHVC), which embeds semantic-level representations into the bitstream to enable direct interaction with the reconstructed signal. The approach combines a VVC texture reference with a compact 31‑D semantic vector derived from a 3D human model (OSX), a mesh-based motion estimation module, and GAN-based frame synthesis to produce high-quality reconstructions with controllable head/p body poses. Key contributions include a 31‑D interoperable semantic encoding scheme, a template-based 3D mesh reconstruction pipeline, a SPADE-based motion estimation network for dense flow and occlusion fields, and end-to-end optimization using perceptual and distortion losses. The framework demonstrates competitive rate–distortion performance at ultra-low bitrates and enables interactive editing of facial and body semantics, offering significant potential for metaverse-style digital human communication and other interactive video applications.
Abstract
In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.
