Table of Contents
Fetching ...

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

Eitan Shaar, Ariel Shaulov, Gal Chechik, Lior Wolf

TL;DR

The paper addresses open-world audio-visual event perception (AVVP), focusing on generalization beyond fixed vocabularies and reducing annotation burdens. It introduces AV^2A, a training-free, model-agnostic framework that uses score-level fusion to preserve multimodal interactions and a within-video label-shift mechanism with dynamic thresholds to adapt to evolving event distributions over time. A first training-free, open-vocabulary baseline is proposed, and AV^2A demonstrates substantial gains on zero-shot and weakly supervised settings, including improvements to existing baselines on AVE and LLP datasets. The approach achieves state-of-the-art results without any training, highlighting efficiency and robustness, though performance still depends on the capabilities and biases of the underlying foundation models.

Abstract

In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

TL;DR

The paper addresses open-world audio-visual event perception (AVVP), focusing on generalization beyond fixed vocabularies and reducing annotation burdens. It introduces AV^2A, a training-free, model-agnostic framework that uses score-level fusion to preserve multimodal interactions and a within-video label-shift mechanism with dynamic thresholds to adapt to evolving event distributions over time. A first training-free, open-vocabulary baseline is proposed, and AV^2A demonstrates substantial gains on zero-shot and weakly supervised settings, including improvements to existing baselines on AVE and LLP datasets. The approach achieves state-of-the-art results without any training, highlighting efficiency and robustness, though performance still depends on the capabilities and biases of the underlying foundation models.

Abstract

In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis (), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.

Paper Structure

This paper contains 24 sections, 24 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overview of the AVVP task. Audio-visual event perception focuses on predicting the temporal boundaries of events within a video that are exclusively visible (shown in blue), exclusively audible (shown in red), or both audible and visible (shown in purple).
  • Figure 2: Overview of $\text{AV}^2\text{A}$. The process begins with category selection, where the input video clip passes through a video-level score-level fusion module (blue) to select relevant categories based on a threshold $\tau_f$. These categories guide segment-level score-level fusion, where a dynamic threshold module (orange) updates thresholds $\boldsymbol{\tau}_t$ via our label-shift technique, using the soft confusion matrix $M$ from prior predictions $\mathcal{Y}_1, \dots, \mathcal{Y}_{t-1}$ and segment scores $P^1_{av}, \dots, P^{t-1}_{av}$, cosine similarity between segments and $P^t_{av}$. Finally, predicted candidates are validated against a confidence threshold $\tau_r$, retaining only those above it. This figure illustrates the process for audio-visual events; audio and visual events are handled similarly.
  • Figure 3: Performance analysis of $\text{AV}^2\text{A}$, based on the LanguageBind zhu2023languagebind model, showcasing predictions on audio-visual events, specifically those occurring simultaneously in both audio and video. Comparisons highlight (a) improvements and (b) failure-cases relative to state-of-the-art weakly supervised baselines.