Table of Contents
Fetching ...

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Ziwei Zhou, Rui Wang, Zuxuan Wu

TL;DR

This work introduces Daily-Omni, a benchmark for audio-visual reasoning with temporal alignment, featuring 684 videos and 1197 MCQ QA pairs across six tasks. It also presents a scalable QA generation pipeline and a training-free Daily-Omni Agent that combines open-source Visual Language Models, Audio Language Models, and ASR to establish a strong baseline. Experiments reveal that current MLLMs struggle with precise cross-modal temporal reasoning, but simple temporal alignment strategies can yield substantial gains, particularly for open-source baselines. Overall, Daily-Omni supplies a scalable platform and methodological toolkit to drive progress in audiovisual temporal grounding and cross-modal reasoning.

Abstract

Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at \href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}.

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

TL;DR

This work introduces Daily-Omni, a benchmark for audio-visual reasoning with temporal alignment, featuring 684 videos and 1197 MCQ QA pairs across six tasks. It also presents a scalable QA generation pipeline and a training-free Daily-Omni Agent that combines open-source Visual Language Models, Audio Language Models, and ASR to establish a strong baseline. Experiments reveal that current MLLMs struggle with precise cross-modal temporal reasoning, but simple temporal alignment strategies can yield substantial gains, particularly for open-source baselines. Overall, Daily-Omni supplies a scalable platform and methodological toolkit to drive progress in audiovisual temporal grounding and cross-modal reasoning.

Abstract

Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at \href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}.

Paper Structure

This paper contains 17 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Examples of Daily-Omni QAs. The audio and visual information required for answering the questions are provided in the figure. The correct answer for the given questions are highlighted.
  • Figure 2: Distribution of 1197 Daily-Omni QA pairs.
  • Figure 3: The outline of Daily-Omni QA construction pipeline. The arrows indicates the sequence of the processes.
  • Figure 4: Details of Daily-Omni annotation generation, revision and event alignment. For cost-efficiency, we align all events with one query.
  • Figure 5: The outline of Daily-Omni Agent.
  • ...and 2 more figures