Table of Contents
Fetching ...

VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought

Xuanyu Zhang, Weiqi Li, Qunliang Xing, Jingfen Xie, Bin Chen, Junlin Li, Li Zhang, Jian Zhang, Shijie Zhao

Abstract

Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.

VQ-Jarvis: Retrieval-Augmented Video Restoration Agent with Sharp Vision and Fast Thought

Abstract

Video restoration in real-world scenarios is challenged by heterogeneous degradations, where static architectures and fixed inference pipelines often fail to generalize. Recent agent-based approaches offer dynamic decision making, yet existing video restoration agents remain limited by insufficient quality perception and inefficient search strategies. We propose VQ-Jarvis, a retrieval-augmented, all-in-one intelligent video restoration agent with sharper vision and faster thought. VQ-Jarvis is designed to accurately perceive degradations and subtle differences among paired restoration results, while efficiently discovering optimal restoration trajectories. To enable sharp vision, we construct VSR-Compare, the first large-scale video paired enhancement dataset with 20K comparison pairs covering 7 degradation types, 11 enhancement operators, and diverse content domains. Based on this dataset, we train a multiple operator judge model and a degradation perception model to guide agent decisions. To achieve fast thought, we introduce a hierarchical operator scheduling strategy that adapts to video difficulty: for easy cases, optimal restoration trajectories are retrieved in a one-step manner from a retrieval-augmented generation (RAG) library; for harder cases, a step-by-step greedy search is performed to balance efficiency and accuracy. Extensive experiments demonstrate that VQ-Jarvis consistently outperforms existing methods on complex degraded videos.
Paper Structure (40 sections, 12 equations, 10 figures, 10 tables)

This paper contains 40 sections, 12 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Overview of the challenges in existing enhancement comparison and the construction of VSR-Compare. (a) Existing quality understanding models (e.g., Q-Insight) fail to reliably capture subtle differences between enhancement results (Acc. $<$ 60%). (b) Data domain and degradation distributions of our VSR-Compare. (c) Training sample from VSR-Compare.
  • Figure 2: Overview of VSR-Compare pipeline, including degradation-aware pair construction, human machine collaborative annotation, and the data distribution of VSR-Compare. The outputs of multiple MLLMs are first filtered to remove inconsistent selections, and their reasoning results are fused; together with additional human expert annotation and verification, they jointly form the comparison pairs.
  • Figure 3: Overview of the proposed VQ-Jarvis framework. Given a degraded input video, a degradation perception model first estimates the video quality score and degradation attributes. Based on the predicted score, the agent adaptively selects between two restoration strategies: one-step retrieval, which retrieves an optimal restoration trajectory from a quality-aligned RAG library using prior reconstruction experience, and step-wise greedy search, which sequentially applies and compares multiple restoration operators across sub-tasks.
  • Figure 4: Qualitative comparison between our VQ-Jarvis and other methods. The time profile of each video are shown below.
  • Figure 5: Visualized results of our VQ-Jarvis and other competing methods on multi-degradation video restoration methods.
  • ...and 5 more figures