Table of Contents
Fetching ...

Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

Seokun Kang, Taehwan Kim

TL;DR

The paper tackles semi-supervised video action recognition by leveraging both visual and audio information. It introduces a transformer-based audio-visual SSL framework that uses an audio source localization-guided mixup to preserve inter-modal relationships, complemented by visual-audio contrastive learning. The method employs a total loss $L_{total}=L_s+\gamma_1L_u+\gamma_2L_{mix}+\gamma_3L_c$ and a flexible pseudo-label threshold $\tau$, achieving state-of-the-art results on UCF-51, Kinetics-400, and VGGSound with very limited labeled data. Ablation studies confirm the benefits of the ASL-guided masking and cross-modal contrastive objectives, highlighting the importance of inter-modal coherence in SSL settings. The work demonstrates that modeling audio-visual interactions in SSL can substantially boost performance, with practical implications for efficient multi-modal video understanding.

Abstract

Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.

Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

TL;DR

The paper tackles semi-supervised video action recognition by leveraging both visual and audio information. It introduces a transformer-based audio-visual SSL framework that uses an audio source localization-guided mixup to preserve inter-modal relationships, complemented by visual-audio contrastive learning. The method employs a total loss and a flexible pseudo-label threshold , achieving state-of-the-art results on UCF-51, Kinetics-400, and VGGSound with very limited labeled data. Ablation studies confirm the benefits of the ASL-guided masking and cross-modal contrastive objectives, highlighting the importance of inter-modal coherence in SSL settings. The work demonstrates that modeling audio-visual interactions in SSL can substantially boost performance, with practical implications for efficient multi-modal video understanding.

Abstract

Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.

Paper Structure

This paper contains 23 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of audio source localization-guided mixup framework. The framework performs audio source localization-guided mixup on video clips A and B. A localization map from video A is used to create a mask that highlights semantically important regions, considering the interrelation between video and audio. Log mel-filterbank coefficients of audios A and B are then interpolated, and the mixed video and audio are used for prediction.
  • Figure 2: Overview of our audio source localization-guided mixup framework. The framework performs the audio source localization-guided mixup on video clips A and B. For generating the audio source localization map, video clip A is utilized. The video and audio from clip A are processed through an audio source localization model to produce the localization maps. This generated map is then used as the weight for performing multinomial sampling without replacement, creating an audio source localization-guided mask. Considering the audio information, this mask guides semantically important regions in video A. Consequently, our proposed audio source localization-guided mixup allows consideration of the interrelation between video and audio modalities sharing the same video clip. For audio A and B, log mel-filterbank coefficients are transformed and interpolated at the pixel level. The resulting mixed video and audio are then used as input for prediction.
  • Figure 3: Visualization of the TubeToken mask and our proposed Audio Source Localization-guided Mask.When an original image (a) is given, the TubeToken masking creates a random pattern mask (b), resulting in a masked image (c) for SVFormer input. Sampling this mask 255 times yields an average TubeToken Mask (d). Our method uses an audio source localization-guided mask. Starting with the original image (a), a localization map (e) is generated and used as a weight for sampling, creating the guided mask (f). Applying this mask to the original image results in (g). Sampling this mask 255 times produces (h), highlighting audio source areas and allowing for visual-audio modality mixup.
  • Figure 4: Impact of threshold ($\tau$). This figure presents the results of experiments conducted to explore the performance impact of the hyper-parameter $\tau$. It specifically focuses on the changes in performance with varying thresholds of $\tau$ during training with only one labeled sample in the UCF-51 dataset.
  • Figure 5: Ablation study on the effect of varying frame counts (1, 2, 4, and 8 frames) on the audio source localization map. This study evaluates the impact of averaging the localization maps over different numbers of frames before using them in the sampling process.