Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin; Gedas Bertasius

Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin, Gedas Bertasius

TL;DR

This work investigates using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining, which uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing the method to scale to larger datasets and model sizes.

Abstract

Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. In this work, we investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes. We pretrain our model using a contrastive audio-visual matching objective with a multi-ratio random masking scheme, which enables our model to process larger audio-visual instance batches, helpful for contrastive learning. Unlike prior audio-visual methods, our method can robustly handle audio, visual, and audio-visual inputs with a single shared ViT backbone. Furthermore, despite using the shared backbone for both modalities, AVSiam achieves competitive or even better results than prior methods on AudioSet and VGGSound for audio-visual classification and retrieval. Our code is available at https://github.com/GenjiB/AVSiam

Siamese Vision Transformers are Scalable Audio-visual Learners

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 4 figures, 13 tables)

This paper contains 17 sections, 3 equations, 4 figures, 13 tables.

Introduction
Related Work
Audio-Visual Representation Learning
Unified Multimodal Representation Learning
Technical Approach
The AVSiam Model
Training the AVSiam Model
Implementation Details
Experimental Setup
Results and Analysis
Audio-Visual Classification Results
Audio-visual Retrieval Results
Throughput Comparison
Ablation Studies
Qualitative Results
...and 2 more sections

Figures (4)

Figure 1: Our Audio-visual Siamese network (AVSiam) uses a single shared backbone to process audio and visual data, which reduces its GPU memory footprint and allows us to scale our method to larger datasets and model sizes. Compared to prior audio-visual approaches cavmaenips23_mavilavmae, which are very costly, our model is both more efficient and also achieves higher accuracy on standard audio-visual classification benchmarks.
Figure 2: Our Pretraining Framework. Our AVSiam approach uses a single shared vision transformer backbone to process both audio and visual data. To train our model, we use a novel multi-ratio masking scheme, which randomly masks audio and visual tokens at various masking ratios. As our pretraining objectives, we employ audio-visual contrastive matching and audio-visual token reconstruction loss functions.
Figure 3: Multi-Ratio Masking. We apply random masking to audio and visual tokens in various proportions during each training iteration.
Figure 4: t-SNE Audio and Image Embedding Visualization. We use t-SNE to visualize the audio and visual features extracted by (1) a baseline that uses separate audio and visual encoders (i.e., CAV-MAE) and (2) our shared-weight encoder method (i.e., AVSiam-Base) on the VGGSound dataset. Each point in the plot represents a single input ($+$ for audio and for visual), while different colors depict distinct audio-visual categories. Based on this illustration, we observe that AVSiam learns more semantically separable features than CAV-MAE. Furthermore, unlike CAV-MAE, AVSiam groups audio and visual features corresponding to the same audio-visual category into the same clusters. This suggests that compared to the methods that use separate audio and visual encoders, our AVSiam with a shared-weight audio-visual encoder learns to encode audio and visual features into a more similar latent space.

Siamese Vision Transformers are Scalable Audio-visual Learners

TL;DR

Abstract

Siamese Vision Transformers are Scalable Audio-visual Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (4)