Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

Xihua Sheng; Li Li; Dong Liu; Houqiang Li

Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

Xihua Sheng, Li Li, Dong Liu, Houqiang Li

TL;DR

This work tackles inter prediction challenges in learned video compression caused by local motion inconsistency and occlusion. It introduces a structure-detail decomposition (SDD) framework to model consistent and inconsistent motions separately, and a long-short-term temporal context fusion strategy that combines ConvLSTM-based long-term contexts with short-term, SDD-derived contexts. The approach employs joint MV encoding for structure/detail, SDD-based temporal context mining, and a multi-faceted entropy model to improve prediction accuracy, achieving substantial bitrate savings—approximately 13.4% on PSNR and 44.1% on MS-SSIM BD-rate against VTM—across multiple test datasets. The results demonstrate improved inter prediction quality with manageable complexity, suggesting strong practical impact for next-generation learned video codecs.

Abstract

Video compression performance is closely related to the accuracy of inter prediction. It tends to be difficult to obtain accurate inter prediction for the local video regions with inconsistent motion and occlusion. Traditional video coding standards propose various technologies to handle motion inconsistency and occlusion, such as recursive partitions, geometric partitions, and long-term references. However, existing learned video compression schemes focus on obtaining an overall minimized prediction error averaged over all regions while ignoring the motion inconsistency and occlusion in local regions. In this paper, we propose a spatial decomposition and temporal fusion based inter prediction for learned video compression. To handle motion inconsistency, we propose to decompose the video into structure and detail (SDD) components first. Then we perform SDD-based motion estimation and SDD-based temporal context mining for the structure and detail components to generate short-term temporal contexts. To handle occlusion, we propose to propagate long-term temporal contexts by recurrently accumulating the temporal information of each historical reference feature and fuse them with short-term temporal contexts. With the SDD-based motion model and long short-term temporal contexts fusion, our proposed learned video codec can obtain more accurate inter prediction. Comprehensive experimental results demonstrate that our codec outperforms the reference software of H.266/VVC on all common test datasets for both PSNR and MS-SSIM.

Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

TL;DR

Abstract

Paper Structure (36 sections, 9 equations, 11 figures, 4 tables)

This paper contains 36 sections, 9 equations, 11 figures, 4 tables.

Introduction
Related Work
Learned Image Compression
Learned Video Compression
Overview
SDD-based Motion Estimation
SDD-based MV Encoder-Decoder
SDD-based Temporal Context Mining
Long-Term Temporal Contexts Generator
Long Short-Term Temporal Contexts Fusion
Contextual Encoder-Decoder and Frame Generator
Entropy Model
Methodology
SDD-based Motion Modeling
Structure and Detail Decomposition
...and 21 more sections

Figures (11)

Figure 1: Overview of our proposed learned video compression scheme: 1) the motion vectors $v_t^{s}$ of structure components and the motion vectors $v_t^{d}$ of detail components are estimated independently but encoded jointly to a quantized latent representation $[m_t]$; 2) $v_t^{s}$ and $v_t^{d}$ are used to warp the structure and detail components of $\hat{F}_{t-1}$ to generate short-term temporal contexts $\bar{C}_t^{0}, \bar{C}_t^{1}, \bar{C}_t^{2}$; 3) a long-term temporal context is generated by recurrently accumulating the temporal information of historical reference features and fused with short-term temporal contexts to generate the final temporal contexts $C_t^0, C_t^1, C_t^2$; 4) the current frame $x_t$ is encoded to the quantized latent $[y_t]$ and decoded to $\hat{x}_{i}$ with the help of learned temporal contexts. "AE" and "AD" represent arithmetic encoder and arithmetic decoder. "[ ]" represents the quantization operator.
Figure 2: Illustration of structure and detail decomposition (SDD). The left column shows the current frame and the reference frame. The middle column shows their structure components. A pair of bi-linear down-sampling (Down) and up-sampling (Up) operations are used to extract the structure components. The right column shows the detail components. They are the difference between the original frames and corresponding structure components. For better visualization, we subtract the detail components from 255.
Figure 3: Illustration of structure and detail decomposition (SDD)-based motion estimation and compression. Both the current frame $x_t$ and reference frame $\hat{x}_{t-1}$ are first decomposed into structure and detail components. For better visualization, we subtract the detail components from 255. Two motion estimation networks are used to estimate the MV $v_t^s$ of structure components ($x_t^s$, $\hat{x}_{t-1}^s$) which contains the consistent motion and the MV $v_t^d$ of detail components ($x_t^d$, $\hat{x}_{t-1}^d$) which contains additional inconsistent motion differences. Then $v_t^s$ and $v_t^d$ are encoded and decoded jointly.
Figure 4: Illustration of SDD-based temporal context mining module.
Figure 5: Illustration of long-term temporal contexts generation. The long-term temporal contexts are generated by recurrently accumulating the temporal information of each historic reference feature.
...and 6 more figures

Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

TL;DR

Abstract

Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (11)