MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
TL;DR
MoA-VR presents a modular, mixture-of-agents framework for all-in-one video restoration that handles complex, mixed degradations through three coordinated agents: degradation identification, routing/restoration, and quality assessment. By leveraging a vision-language degradation identifier (MoA-VD) and a VQA-tailored restoration dataset (Res-VQ), it enables adaptive, pipeline-driven restoration guided by LLMs and VLMs in a closed-loop loop. The system demonstrates superior performance over state-of-the-art all-in-one VR methods on both objective metrics (PSNR/SSIM) and perceptual quality (MANIQA, CLIP-IQA, MUSIQ), with strong generalization to unseen degradations and real-world videos. The work highlights the potential of combining multimodal perception, modular reasoning, and perceptual feedback for robust, scalable video restoration in real-world scenarios.
Abstract
Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.
