Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention
Ina Salaj, Arijit Biswas
TL;DR
This paper tackles the problem of predicting perceptual audio-visual quality under bandwidth constraints by introducing Attentive AV-FusionNet, a full-reference AVQ model that fuses GML-derived audio features with VMAF video features through bidirectional cross-attention followed by self-attention. It adds a modality relevance estimator to produce content-aware, per-content importance of audio versus video, enabling potential bitrate adaptation. The approach yields state-of-the-art AVQ prediction accuracy on both internal and external datasets, with strong cross-dataset generalization, and demonstrates robustness across diverse content types. The work lays groundwork for adaptive streaming systems and motivates future real-time deployment and explainable attention analyses.
Abstract
We introduce a novel deep learning-based audio-visual quality (AVQ) prediction model that leverages internal features from state-of-the-art unimodal predictors. Unlike prior approaches that rely on simple fusion strategies, our model employs a hybrid representation that combines learned Generative Machine Listener (GML) audio features with hand-crafted Video Multimethod Assessment Fusion (VMAF) video features. Attention mechanisms capture cross-modal interactions and intra-modal relationships, yielding context-aware quality representations. A modality relevance estimator quantifies each modality's contribution per content, potentially enabling adaptive bitrate allocation. Experiments demonstrate improved AVQ prediction accuracy and robustness across diverse content types.
