Table of Contents
Fetching ...

Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention

Ina Salaj, Arijit Biswas

TL;DR

This paper tackles the problem of predicting perceptual audio-visual quality under bandwidth constraints by introducing Attentive AV-FusionNet, a full-reference AVQ model that fuses GML-derived audio features with VMAF video features through bidirectional cross-attention followed by self-attention. It adds a modality relevance estimator to produce content-aware, per-content importance of audio versus video, enabling potential bitrate adaptation. The approach yields state-of-the-art AVQ prediction accuracy on both internal and external datasets, with strong cross-dataset generalization, and demonstrates robustness across diverse content types. The work lays groundwork for adaptive streaming systems and motivates future real-time deployment and explainable attention analyses.

Abstract

We introduce a novel deep learning-based audio-visual quality (AVQ) prediction model that leverages internal features from state-of-the-art unimodal predictors. Unlike prior approaches that rely on simple fusion strategies, our model employs a hybrid representation that combines learned Generative Machine Listener (GML) audio features with hand-crafted Video Multimethod Assessment Fusion (VMAF) video features. Attention mechanisms capture cross-modal interactions and intra-modal relationships, yielding context-aware quality representations. A modality relevance estimator quantifies each modality's contribution per content, potentially enabling adaptive bitrate allocation. Experiments demonstrate improved AVQ prediction accuracy and robustness across diverse content types.

Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention

TL;DR

This paper tackles the problem of predicting perceptual audio-visual quality under bandwidth constraints by introducing Attentive AV-FusionNet, a full-reference AVQ model that fuses GML-derived audio features with VMAF video features through bidirectional cross-attention followed by self-attention. It adds a modality relevance estimator to produce content-aware, per-content importance of audio versus video, enabling potential bitrate adaptation. The approach yields state-of-the-art AVQ prediction accuracy on both internal and external datasets, with strong cross-dataset generalization, and demonstrates robustness across diverse content types. The work lays groundwork for adaptive streaming systems and motivates future real-time deployment and explainable attention analyses.

Abstract

We introduce a novel deep learning-based audio-visual quality (AVQ) prediction model that leverages internal features from state-of-the-art unimodal predictors. Unlike prior approaches that rely on simple fusion strategies, our model employs a hybrid representation that combines learned Generative Machine Listener (GML) audio features with hand-crafted Video Multimethod Assessment Fusion (VMAF) video features. Attention mechanisms capture cross-modal interactions and intra-modal relationships, yielding context-aware quality representations. A modality relevance estimator quantifies each modality's contribution per content, potentially enabling adaptive bitrate allocation. Experiments demonstrate improved AVQ prediction accuracy and robustness across diverse content types.

Paper Structure

This paper contains 7 sections, 9 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Attentive AV-FusionNet: joint audio–visual quality prediction model integrating GML gml and VMAF NetflixTechBlogVMAF features, with 7.4M trainable parameters in the projection and fusion network.
  • Figure 2: Modality importance estimation for different content types.