AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Yuan Tseng; Layne Berry; Yi-Ting Chen; I-Hsiang Chiu; Hsuan-Hao Lin; Max Liu; Puyuan Peng; Yi-Jen Shih; Hung-Yu Wang; Haibin Wu; Po-Yao Huang; Chun-Mao Lai; Shang-Wen Li; David Harwath; Yu Tsao; Shinji Watanabe; Abdelrahman Mohamed; Chi-Luen Feng; Hung-yi Lee

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

TL;DR

The AV-SUPERB benchmark is proposed that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing and shows that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.

Abstract

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

TL;DR

Abstract

Paper Structure (12 sections, 1 figure, 2 tables)

This paper contains 12 sections, 1 figure, 2 tables.

Introduction
Related Work
Benchmark Details
Downstream Task Selection
Pretrained Upstream Models
Experimental Results and Discussion
Downstream Datasets and Training Details
Overall Results
When does Visual Grounding Improve Audio Representation Learning?
Layer-wise Contribution Analysis
How does intermediate-task fine-tuning affect performance?
Conclusions

Figures (1)

Figure 1: We consider three evaluation scenarios: extracting features using inputs from one or both modalities. Following superb, the weighted-sum of features from Transformer layers (if applicable) are used as input for fine-tuning a small downstream model for each individual task. Details of selected tasks are given in Section \ref{['section:tasks']}.

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

TL;DR

Abstract

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)