Table of Contents
Fetching ...

Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin, Gedas Bertasius

TL;DR

This work investigates using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining, which uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing the method to scale to larger datasets and model sizes.

Abstract

Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. In this work, we investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes. We pretrain our model using a contrastive audio-visual matching objective with a multi-ratio random masking scheme, which enables our model to process larger audio-visual instance batches, helpful for contrastive learning. Unlike prior audio-visual methods, our method can robustly handle audio, visual, and audio-visual inputs with a single shared ViT backbone. Furthermore, despite using the shared backbone for both modalities, AVSiam achieves competitive or even better results than prior methods on AudioSet and VGGSound for audio-visual classification and retrieval. Our code is available at https://github.com/GenjiB/AVSiam

Siamese Vision Transformers are Scalable Audio-visual Learners

TL;DR

This work investigates using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining, which uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing the method to scale to larger datasets and model sizes.

Abstract

Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. In this work, we investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes. We pretrain our model using a contrastive audio-visual matching objective with a multi-ratio random masking scheme, which enables our model to process larger audio-visual instance batches, helpful for contrastive learning. Unlike prior audio-visual methods, our method can robustly handle audio, visual, and audio-visual inputs with a single shared ViT backbone. Furthermore, despite using the shared backbone for both modalities, AVSiam achieves competitive or even better results than prior methods on AudioSet and VGGSound for audio-visual classification and retrieval. Our code is available at https://github.com/GenjiB/AVSiam
Paper Structure (17 sections, 3 equations, 4 figures, 13 tables)

This paper contains 17 sections, 3 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Our Audio-visual Siamese network (AVSiam) uses a single shared backbone to process audio and visual data, which reduces its GPU memory footprint and allows us to scale our method to larger datasets and model sizes. Compared to prior audio-visual approaches cavmaenips23_mavilavmae, which are very costly, our model is both more efficient and also achieves higher accuracy on standard audio-visual classification benchmarks.
  • Figure 2: Our Pretraining Framework. Our AVSiam approach uses a single shared vision transformer backbone to process both audio and visual data. To train our model, we use a novel multi-ratio masking scheme, which randomly masks audio and visual tokens at various masking ratios. As our pretraining objectives, we employ audio-visual contrastive matching and audio-visual token reconstruction loss functions.
  • Figure 3: Multi-Ratio Masking. We apply random masking to audio and visual tokens in various proportions during each training iteration.
  • Figure 4: t-SNE Audio and Image Embedding Visualization. We use t-SNE to visualize the audio and visual features extracted by (1) a baseline that uses separate audio and visual encoders (i.e., CAV-MAE) and (2) our shared-weight encoder method (i.e., AVSiam-Base) on the VGGSound dataset. Each point in the plot represents a single input ($+$ for audio and for visual), while different colors depict distinct audio-visual categories. Based on this illustration, we observe that AVSiam learns more semantically separable features than CAV-MAE. Furthermore, unlike CAV-MAE, AVSiam groups audio and visual features corresponding to the same audio-visual category into the same clusters. This suggests that compared to the methods that use separate audio and visual encoders, our AVSiam with a shared-weight audio-visual encoder learns to encode audio and visual features into a more similar latent space.