Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering
Zijian Fu, Changsheng Lv, Mengshi Qi, Huadong Ma
TL;DR
SHRIKE tackles audio-visual question answering by introducing a structured multi-modal scene graph that explicitly encodes object relations across visual and audio streams, coupled with a Kolmogorov-Arnold Network-based mixture of experts for fine-grained, question-guided temporal integration. The approach comprises a dedicated Scene Graph Decoder to generate relation triplets from fused audiovisual features and a KAN-based MoE to capture nuanced cross-modal temporal patterns, trained in two stages to separate scene-graph learning from end-to-end QA. Empirical results on MUSIC-AVQA and MUSIC-AVQA v2 demonstrate state-of-the-art performance and explicit temporal grounding of cues, with ablations confirming the contributions of the MMSG and KAN components. The method also shows transferability to related tasks and offers insights into efficient multi-modal grounding and temporal reasoning.
Abstract
In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.
