Table of Contents
Fetching ...

Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

Zijian Fu, Changsheng Lv, Mengshi Qi, Huadong Ma

TL;DR

SHRIKE tackles audio-visual question answering by introducing a structured multi-modal scene graph that explicitly encodes object relations across visual and audio streams, coupled with a Kolmogorov-Arnold Network-based mixture of experts for fine-grained, question-guided temporal integration. The approach comprises a dedicated Scene Graph Decoder to generate relation triplets from fused audiovisual features and a KAN-based MoE to capture nuanced cross-modal temporal patterns, trained in two stages to separate scene-graph learning from end-to-end QA. Empirical results on MUSIC-AVQA and MUSIC-AVQA v2 demonstrate state-of-the-art performance and explicit temporal grounding of cues, with ablations confirming the contributions of the MMSG and KAN components. The method also shows transferability to related tasks and offers insights into efficient multi-modal grounding and temporal reasoning.

Abstract

In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.

Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering

TL;DR

SHRIKE tackles audio-visual question answering by introducing a structured multi-modal scene graph that explicitly encodes object relations across visual and audio streams, coupled with a Kolmogorov-Arnold Network-based mixture of experts for fine-grained, question-guided temporal integration. The approach comprises a dedicated Scene Graph Decoder to generate relation triplets from fused audiovisual features and a KAN-based MoE to capture nuanced cross-modal temporal patterns, trained in two stages to separate scene-graph learning from end-to-end QA. Empirical results on MUSIC-AVQA and MUSIC-AVQA v2 demonstrate state-of-the-art performance and explicit temporal grounding of cues, with ablations confirming the contributions of the MMSG and KAN components. The method also shows transferability to related tasks and offers insights into efficient multi-modal grounding and temporal reasoning.

Abstract

In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.

Paper Structure

This paper contains 26 sections, 10 equations, 22 figures, 7 tables, 1 algorithm.

Figures (22)

  • Figure 1: Illustration of AVQA task. Given an video, we construct a multi-modal scene graph that encodes objects, visual and audio relationships. The question text is then used to select the most relevant relationships, forming a question-conditioned subgraph that is fed into the fusion and reasoning module to output the answer.
  • Figure 2: Overview of the proposed SHRIKE: Features from each modality are obtained by passing the input through a corresponding pretrained encoder. Then we propose (a) Multi-Modal Scene Graph Decoder to extract scene graph features from the video and select specific triplets using the relationship triplets selection. Through (b) Temporal Integration with Gaussian KAN Experts, our model achieves effective question-guided localization of critical temporal segments, enhancing temporal reasoning and multi-modal understanding.
  • Figure 3: Visualized relationship triplet selection results. In the given example, we convert all the relationship triplets into a scene graph and highlight the triplets selected by our method.
  • Figure 3: Ablation study of the proposed framework. M$^2$SG is Multi-Modal Scene Graph Module.
  • Figure 4: Visual Question
  • ...and 17 more figures