Table of Contents
Fetching ...

Towards Blind Bitstream-corrupted Video Recovery via a Visual Foundation Model-driven Framework

Tianyi Liu, Kejun Wu, Chen Cai, Yi Wang, Kim-Hui Yap, Lap-Pui Chau

TL;DR

This work tackles the problem of recovering high-quality video content from bitstream-corrupted inputs without frame-by-frame mask annotations. It introduces a visual foundation model–driven framework comprising a Detect Any Corruption (DAC) module for corruption localization and a Corruption-aware Feature Completion (CFC) module that uses a Mixture-of-Residual-Experts to refine features based on high-level corruption understanding. DAC leverages bitstream priors and cross-domain prompting to produce robust corruption masks and embeddings, while CFC enhances intermediate representations with multi-scale embeddings and adaptive expert fusion, enabling effective blind recovery. Across the BSCV dataset, the method achieves state-of-the-art results in blind and non-blind settings, demonstrating practical potential for robust multimedia communication and downstream perception tasks such as object detection and captioning.

Abstract

Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.

Towards Blind Bitstream-corrupted Video Recovery via a Visual Foundation Model-driven Framework

TL;DR

This work tackles the problem of recovering high-quality video content from bitstream-corrupted inputs without frame-by-frame mask annotations. It introduces a visual foundation model–driven framework comprising a Detect Any Corruption (DAC) module for corruption localization and a Corruption-aware Feature Completion (CFC) module that uses a Mixture-of-Residual-Experts to refine features based on high-level corruption understanding. DAC leverages bitstream priors and cross-domain prompting to produce robust corruption masks and embeddings, while CFC enhances intermediate representations with multi-scale embeddings and adaptive expert fusion, enabling effective blind recovery. Across the BSCV dataset, the method achieves state-of-the-art results in blind and non-blind settings, demonstrating practical potential for robust multimedia communication and downstream perception tasks such as object detection and captioning.

Abstract

Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a). Bitstream-corrupted video recovery framework requires manually labelled masks to identify the corrupted region, which incurs significant costs. (b). Blind bitstream-corrupted video recovery focuses on eliminating the need for mask annotation by effectively localizing the corrupted region and performing enhanced recovery.
  • Figure 2: Architecture of Detect Any Corruption module.
  • Figure 3: Architecture of the proposed blind bitstream-corrupted video recovery framework and detailed design of the Corruption-aware Feature Completion module. During the CFC training, the trained DAC module will provide corruption indication and foundational corruption embeddings.
  • Figure 4: Visual comparison results under blind recovery setting. We visualize the masked regions indicated by the SAM2.1 and DAC to simultaneously demonstrate corruption detection and video corruption recovery performance within the target areas.
  • Figure 5: Recovering bitstream-corrupted video improves the resilience of multimedia systems in noisy environments, mitigating the risk of perceptual errors.