Table of Contents
Fetching ...

Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

Shuo Xing, Soumik Dey, Mingyang Wu, Ashirbad Mishra, Naveen Ravipati, Binbin Li, Hansi Wu, Zhengzhong Tu

TL;DR

Q-Router introduces a vision-language model–driven routing framework that orchestrates a pool of specialized VQA experts to achieve universal video quality assessment. A three-tier pipeline balances efficiency, accuracy, and interpretability, with probabilistic frame extraction and spatiotemporal artifact localization providing actionable evidence. Across UGC, AIGC, and Q-Bench-Video benchmarks, Q-Router delivers state-of-the-art performance and robust generalization, supported by artifact heatmaps that aid debugging and post-processing. The work demonstrates the potential of expert routing for scalable, interpretable multimodal evaluation and points to future extensions in restoration and broader VQA tasks.

Abstract

Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

TL;DR

Q-Router introduces a vision-language model–driven routing framework that orchestrates a pool of specialized VQA experts to achieve universal video quality assessment. A three-tier pipeline balances efficiency, accuracy, and interpretability, with probabilistic frame extraction and spatiotemporal artifact localization providing actionable evidence. Across UGC, AIGC, and Q-Bench-Video benchmarks, Q-Router delivers state-of-the-art performance and robust generalization, supported by artifact heatmaps that aid debugging and post-processing. The work demonstrates the potential of expert routing for scalable, interpretable multimodal evaluation and points to future extensions in restoration and broader VQA tasks.

Abstract

Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

Paper Structure

This paper contains 29 sections, 10 figures, 3 tables, 4 algorithms.

Figures (10)

  • Figure 1: We present Q-Router, an agentic framework designed for diverse video quality assessment tasks. Q-Router leverages a VLM as the router to dynamically assign the most suitable expert model from a comprehensive pool of state-of-the-art VQA methods. The expert pool includes COVER cover2024cpvrws, DOVER wu2023dover, BVQA wen2024modular, UVQA wang2021rich, MaxVQA maxvqa, and T2VQA kou2024subjective, enabling robust and adaptive evaluation across user-generated, AI-generated, and computer-generated video content.
  • Figure 2: Prompt for VQA with Q-Router (Tier 1) using GPT-4o.
  • Figure 3: Distorted frames and their corresponding artifact localization heatmaps across UGC, AIGC, and CG videos. The first row shows example distorted frames, while the second row highlights suspicious regions detected by Q-Router.
  • Figure 4: Impact of prompting structure across both UGC and AIGC benchmarks for VQA, with GPT-4o as backbone.
  • Figure 5: Prompt for viqual question answering with Q-Router (Tier 0) using GPT-4o.
  • ...and 5 more figures