Table of Contents
Fetching ...

Towards Open-Vocabulary Audio-Visual Event Localization

Jinxing Zhou, Dan Guo, Ruohao Guo, Yuxin Mao, Jingjing Hu, Yiran Zhong, Xiaojun Chang, Meng Wang

TL;DR

This work defines Open-Vocabulary Audio-Visual Event Localization (OV-AVEL), which requires locating and categorizing audiovisual events at segment granularity for both seen and unseen classes. It introduces OV-AVEBench, a 24,800-video dataset across 67 real-world scenes with segment-level annotations and standardized metrics to evaluate accuracy and segment/event-level F1 scores, enabling open-vocabulary evaluation. Two baselines are studied: a training-free approach using ImageBind's joint multimodal space for zero-shot predictions and a fine-tuning approach that adds lightweight temporal Transformer layers to learn temporal relations, with an explicit 'other' class to handle unknowns. Results show that the fine-tuning baseline yields substantial gains, especially for unseen classes, and highlight the importance of temporal modeling and careful text-space design (e.g., the 'other' class and sqrt fusion) for robust open-vocabulary AVEL.

Abstract

The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible. Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as ``unknown'', but without providing category-specific semantics. In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with manual segment-level annotation. We also establish three evaluation metrics for this task. Moreover, we investigate two baseline approaches, one training-free and one using a further fine-tuning paradigm. Specifically, we utilize the unified multimodal space from the pretrained ImageBind model to extract audio, visual, and textual (event classes) features. The training-free baseline then determines predictions by comparing the consistency of audio-text and visual-text feature similarities. The fine-tuning baseline incorporates lightweight temporal layers to encode temporal relations within the audio and visual modalities, using OV-AVEBench training data for model fine-tuning. We evaluate these baselines on the proposed OV-AVEBench dataset and discuss potential directions for future work in this new field.

Towards Open-Vocabulary Audio-Visual Event Localization

TL;DR

This work defines Open-Vocabulary Audio-Visual Event Localization (OV-AVEL), which requires locating and categorizing audiovisual events at segment granularity for both seen and unseen classes. It introduces OV-AVEBench, a 24,800-video dataset across 67 real-world scenes with segment-level annotations and standardized metrics to evaluate accuracy and segment/event-level F1 scores, enabling open-vocabulary evaluation. Two baselines are studied: a training-free approach using ImageBind's joint multimodal space for zero-shot predictions and a fine-tuning approach that adds lightweight temporal Transformer layers to learn temporal relations, with an explicit 'other' class to handle unknowns. Results show that the fine-tuning baseline yields substantial gains, especially for unseen classes, and highlight the importance of temporal modeling and careful text-space design (e.g., the 'other' class and sqrt fusion) for robust open-vocabulary AVEL.

Abstract

The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible. Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as ``unknown'', but without providing category-specific semantics. In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with manual segment-level annotation. We also establish three evaluation metrics for this task. Moreover, we investigate two baseline approaches, one training-free and one using a further fine-tuning paradigm. Specifically, we utilize the unified multimodal space from the pretrained ImageBind model to extract audio, visual, and textual (event classes) features. The training-free baseline then determines predictions by comparing the consistency of audio-text and visual-text feature similarities. The fine-tuning baseline incorporates lightweight temporal layers to encode temporal relations within the audio and visual modalities, using OV-AVEBench training data for model fine-tuning. We evaluate these baselines on the proposed OV-AVEBench dataset and discuss potential directions for future work in this new field.

Paper Structure

This paper contains 21 sections, 2 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: (a) Illustration of the AVEL task, which aims to temporally localize segments containing events that are both audible and visible, and identify their categories. (b) Studies of AVEL in different settings. In contrast to previous closed-set and open-set settings, we explore a more practical open-vocabulary AVEL problem, which needs to infer explicit event categories for both seen and unseen test data (i.e., data containing classes seen and unseen during training). Each color represents a distinct event class.
  • Figure 2: Statistics about the proposed OV-AVEBench dataset. (a) Our OV-AVEBench contains 24,800 videos covering 67 practical audio-visual scenes from the real world. Each event category and its corresponding video amount are listed. The category highlighted by a black bounding box indicates that data in that category is only available during the inference phase (unseen classes/data). (b) The audio-visual events in the videos exhibit various temporal scales, with some containing only background. We also visualize the category distribution (c), the video distribution (d) of the seen and unseen data, and the video counts for the training, validation, and test sets (e).
  • Figure 3: Overview of the proposed baseline methods. We utilize the audio and image encoders of the pretrained Imagebind girdhar2023imagebind (with frozen parameters) to extract segment-level audio and visual features. ① The training-free baseline sends texts of all candidate classes (both seen and unseen) to extract features. Then, the audio-visual event prediction is decided by evaluating the consistency between audio-text and visual-text feature similarities. ② The fine-tuning baseline additionally inserts some temporal layers into the audio and visual encoders to strengthen temporal interaction learning. This model is fine-tuned/trained with training data (with seen classes). Only the texts of seen classes are known and used in model fine-tuning, while the unseen classes are additionally introduced during inference. The final audio-visual event prediction is obtained by fusing event probabilities of audio and visual modalities.
  • Figure A4: Detailed performance of the proposed two baselines on each event class.
  • Figure A5: Qualitative examples for seen audio-visual event localization.
  • ...and 1 more figures