UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
Tiantian Geng, Teng Wang, Jinming Duan, Yanfu Zhang, Weili Guan, Feng Zheng, Ling shao
TL;DR
UniAV introduces a unified framework for joint temporal action localization, sound event detection, and audio-visual event localization in untrimmed videos. It combines unified audio-visual encoding with task-specific experts and a language-aware classifier to share knowledge across tasks while preserving task-specific insights, enabling open-set and cross-task localization via prompts. Across ActivityNet 1.3, DESED, and UnAV-100, UniAV achieves state-of-the-art or competitive results, outperforming single-task models and naive multi-task baselines, and also serves as effective pre-training for downstream single-task models. The approach highlights the value of multimodal sharing, prompt-driven category embeddings, and multi-scale temporal modeling for holistic video understanding with practical open-world capabilities.
Abstract
Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.
