Table of Contents
Fetching ...

A Mamba-based Perceptual Loss Function for Learning-based UGC Transcoding

Zihao Qi, Chen Feng, Fan Zhang, Xiaozhong Xu, Shan Liu, David Bull

Abstract

In user-generated content (UGC) transcoding, source videos typically suffer various degradations due to prior compression, editing, or suboptimal capture conditions. Consequently, existing video compression paradigms that solely optimize for fidelity relative to the reference become suboptimal, as they force the codec to replicate the inherent artifacts of the non-pristine source. To address this, we propose a novel perceptually inspired loss function for learning-based UGC video transcoding that redefines the role of the reference video, shifting it from a ground-truth pixel anchor to an informative contextual guide. Specifically, we train a lightweight neural quality model based on a Selective Structured State-Space Model (Mamba) optimized using a weakly-supervised Siamese ranking strategy. The proposed model is then integrated into the rate-distortion optimization (RDO) process of two neural video codecs (DCVC and HiNeRV) as a loss function, aiming to generate reconstructed content with improved perceptual quality. Our experiments demonstrate that this framework achieves substantial coding gains over both autoencoder and implicit neural representation-based baselines, with 8.46% and 12.89% BD-rate savings, respectively.

A Mamba-based Perceptual Loss Function for Learning-based UGC Transcoding

Abstract

In user-generated content (UGC) transcoding, source videos typically suffer various degradations due to prior compression, editing, or suboptimal capture conditions. Consequently, existing video compression paradigms that solely optimize for fidelity relative to the reference become suboptimal, as they force the codec to replicate the inherent artifacts of the non-pristine source. To address this, we propose a novel perceptually inspired loss function for learning-based UGC video transcoding that redefines the role of the reference video, shifting it from a ground-truth pixel anchor to an informative contextual guide. Specifically, we train a lightweight neural quality model based on a Selective Structured State-Space Model (Mamba) optimized using a weakly-supervised Siamese ranking strategy. The proposed model is then integrated into the rate-distortion optimization (RDO) process of two neural video codecs (DCVC and HiNeRV) as a loss function, aiming to generate reconstructed content with improved perceptual quality. Our experiments demonstrate that this framework achieves substantial coding gains over both autoencoder and implicit neural representation-based baselines, with 8.46% and 12.89% BD-rate savings, respectively.

Paper Structure

This paper contains 14 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of the UGC video delivery pipeline. The $\mathbf{S}$ource content captured by a user is directly compressed on the user device for storage, as non-pristine $\mathbf{R}$eferences. The latter is then uploaded onto UGC streaming platforms and further compressed into $\mathbf{D}$istorted videos before being transmitted to the viewer.
  • Figure 2: (A) Illustration of the PT-Loss integration into both DCVC and HiNeRV codecs. (B) The architecture of the PT-Loss network.
  • Figure 3: Overall and group-wise compression results on BVI-UGC datasets across different reference groups.
  • Figure 4: Visual comparison between DCVC and DCVC+PT-Loss.
  • Figure 5: Visual comparison between HiNeRV and HiNeRV+PT-Loss.
  • ...and 1 more figures