Table of Contents
Fetching ...

Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality

Mohamed Amine Kerkouri, Marouane Tliba, Aladine Chetouani, Nour Aburaed, Alessandro Bruno

TL;DR

This paper argues that Mean Opinion Score (MOS) is inadequate as the sole supervisory signal for multimedia quality assessment, as it hides semantic failures, user intent, and justification. It proposes a paradigm shift to context-aware, reasoning-centric, and multimodal quality assessment, detailing how these pillars address MOS deficiencies and outlining a roadmap for richer benchmarks, data collection, and evaluation metrics. The contributions include a concrete design for context-conditioned modeling, structured reasoning outputs, and multimodal alignment, along with benchmark reforms that incorporate persona-conditioned judgments and rationale annotations. The work highlights practical impact for high-stakes domains by enabling interpretable, task-specific, and human-aligned quality judgments, ultimately moving toward trustworthy and robust quality assessment systems.

Abstract

This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.

Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality

TL;DR

This paper argues that Mean Opinion Score (MOS) is inadequate as the sole supervisory signal for multimedia quality assessment, as it hides semantic failures, user intent, and justification. It proposes a paradigm shift to context-aware, reasoning-centric, and multimodal quality assessment, detailing how these pillars address MOS deficiencies and outlining a roadmap for richer benchmarks, data collection, and evaluation metrics. The contributions include a concrete design for context-conditioned modeling, structured reasoning outputs, and multimodal alignment, along with benchmark reforms that incorporate persona-conditioned judgments and rationale annotations. The work highlights practical impact for high-stakes domains by enabling interpretable, task-specific, and human-aligned quality judgments, ultimately moving toward trustworthy and robust quality assessment systems.

Abstract

This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.

Paper Structure

This paper contains 20 sections, 1 figure.

Figures (1)

  • Figure 1: An illustrative example comparing the a reduced pipeline from our paradigm (b) with a traditional MOS quality assessment system (a) in a clinical setting. By providing the context to the model, it can focus on estimating the quality for a specific task, as well as give recommendations, all while the reasoning behind the decision and recommendation stay transparent through the thinking CoT.