Table of Contents
Fetching ...

MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

TL;DR

MoA-VR presents a modular, mixture-of-agents framework for all-in-one video restoration that handles complex, mixed degradations through three coordinated agents: degradation identification, routing/restoration, and quality assessment. By leveraging a vision-language degradation identifier (MoA-VD) and a VQA-tailored restoration dataset (Res-VQ), it enables adaptive, pipeline-driven restoration guided by LLMs and VLMs in a closed-loop loop. The system demonstrates superior performance over state-of-the-art all-in-one VR methods on both objective metrics (PSNR/SSIM) and perceptual quality (MANIQA, CLIP-IQA, MUSIQ), with strong generalization to unseen degradations and real-world videos. The work highlights the potential of combining multimodal perception, modular reasoning, and perceptual feedback for robust, scalable video restoration in real-world scenarios.

Abstract

Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.

MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

TL;DR

MoA-VR presents a modular, mixture-of-agents framework for all-in-one video restoration that handles complex, mixed degradations through three coordinated agents: degradation identification, routing/restoration, and quality assessment. By leveraging a vision-language degradation identifier (MoA-VD) and a VQA-tailored restoration dataset (Res-VQ), it enables adaptive, pipeline-driven restoration guided by LLMs and VLMs in a closed-loop loop. The system demonstrates superior performance over state-of-the-art all-in-one VR methods on both objective metrics (PSNR/SSIM) and perceptual quality (MANIQA, CLIP-IQA, MUSIQ), with strong generalization to unseen degradations and real-world videos. The work highlights the potential of combining multimodal perception, modular reasoning, and perceptual feedback for robust, scalable video restoration in real-world scenarios.

Abstract

Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.

Paper Structure

This paper contains 17 sections, 15 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of the agents in MoA-VR. MoA-VR restores low-quality video clips with complex degradations through the collaboration of three agents: the degradation identification agent, the routing and restoration agent, and the quality assessment agent.
  • Figure 2: MoA-VR incorporates three specialized agents within a closed-loop architecture. For a low-quality input video, $\mathcal{A}_i$ identifies the degradation type and level; $\mathcal{A}_r$ generates a degradation removal plan and then invokes the corresponding restoration toolbox; $\mathcal{A}_a$ assesses all the intermediate results and chooses the best quality one. Then $\mathcal{A}_i$ identifies whether the previous restoration was successful. If it fails, $\mathcal{A}_r$ rolls back and reroutes; if successful, $\mathcal{A}_r$ follows the previous plan. This loop continues until all degradations are removed.
  • Figure 3: Feature distribution of (a) MoA-VD-GT and (b) MoA-VD-LQ. SI and TI indicate spatial and temporal information, respectively.
  • Figure 4: Visual Examples of Different Video Degradations
  • Figure 5: The overall framework of $\mathcal{A}_i$. $\mathcal{A}_i$ can evaluate all types of degradation levels in an all-in-one framework. It can process videos, along with prompts, to identify the degradations. It consists of a vision encoder to extract both spatial and temporal features and a text tokenizer to tokenize the input prompts. These features are projected into the same space by trained projectors. A pre-trained LLM is utilized to fuse the features while fine-tuned with LoRA.
  • ...and 9 more figures