Table of Contents
Fetching ...

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

Tiantian Geng, Teng Wang, Jinming Duan, Yanfu Zhang, Weili Guan, Feng Zheng, Ling shao

TL;DR

UniAV introduces a unified framework for joint temporal action localization, sound event detection, and audio-visual event localization in untrimmed videos. It combines unified audio-visual encoding with task-specific experts and a language-aware classifier to share knowledge across tasks while preserving task-specific insights, enabling open-set and cross-task localization via prompts. Across ActivityNet 1.3, DESED, and UnAV-100, UniAV achieves state-of-the-art or competitive results, outperforming single-task models and naive multi-task baselines, and also serves as effective pre-training for downstream single-task models. The approach highlights the value of multimodal sharing, prompt-driven category embeddings, and multi-scale temporal modeling for holistic video understanding with practical open-world capabilities.

Abstract

Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

TL;DR

UniAV introduces a unified framework for joint temporal action localization, sound event detection, and audio-visual event localization in untrimmed videos. It combines unified audio-visual encoding with task-specific experts and a language-aware classifier to share knowledge across tasks while preserving task-specific insights, enabling open-set and cross-task localization via prompts. Across ActivityNet 1.3, DESED, and UnAV-100, UniAV achieves state-of-the-art or competitive results, outperforming single-task models and naive multi-task baselines, and also serves as effective pre-training for downstream single-task models. The approach highlights the value of multimodal sharing, prompt-driven category embeddings, and multi-scale temporal modeling for holistic video understanding with practical open-world capabilities.

Abstract

Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.
Paper Structure (17 sections, 5 equations, 10 figures, 10 tables)

This paper contains 17 sections, 5 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: We propose a unified audio-visual perception framework to localize all three kinds of instances in untrimmed videos, including visual actions, sound events and audio-visual events. All these instances equally contribute to the comprehensive understanding of video content.
  • Figure 2: The overview of our unified framework. Given a visual and audio pair from an untrimmed video, we first tokenize them by a pair of pre-trained visual and audio encoders to obtain generic audio and visual representations. Then, the encoded features are fed into an audio-visual pyramid transformer, which consists of $L_{1}$ uni-modal and $L_{2}$ cross-modal transformer blocks for temporal relation modeling and audio-visual fusion at multiple temporal scales. The task-specific experts inserted in transformer blocks learn distinct knowledge for each task, which is illustrated on the right side. Besides, the categories of each task are encoded with task prompts to compute similarities with the obtained audio-visual feature pyramid, which is used to perform language-aware classification. Finally, the model recognizes classes and regresses temporal boundaries for all types of instances occurring in the video.
  • Figure 3: Visualization of dataset statistics. The circle area is proportional to the number of videos in each dataset. There are substantial gaps in dataset scales and video/instance duration between these task-specific datasets.
  • Figure 4: Qualitative results on TAL, AVEL and SED tasks. The examples are from the validation set of ActivityNet 1.3, the test set of UnAV-100, and the public evaluation set of DESED, respectively. "GT" is short for ground truth. "SOTA" denotes the state-of-the-art methods (TriDet shi2023tridet for TAL, UnAV geng2023dense for AVEL, and the audio-only single-task model for SED, where all models use the same features). "Ours" is our AT model. We show the boundaries with the highest overlap with the ground truth.
  • Figure 5: Localization results of our AT model on all three tasks with different text encoders for category embedding, measured by mAP@tIoU=0.5.
  • ...and 5 more figures