Table of Contents
Fetching ...

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

Jinxing Zhou, Dan Guo, Yiran Zhong, Meng Wang

TL;DR

This work addresses weakly-supervised audio-visual video parsing by introducing VAPLAN, which generates segment-level pseudo labels for visual and audio streams using frozen CLIP and CLAP models (PLG). It then leverages these labels through a richness-aware loss that aligns category-richness and segment-richness with model predictions (PLE) and refines visual pseudo labels via a segment-wise denoising strategy (PLD). The approach yields state-of-the-art performance on LLP for audio, visual, and audio-visual event parsing and generalizes to related AV tasks such as AVEL, demonstrating strong potential for fine-grained, open-vocabulary supervision in multimodal video understanding.

Abstract

The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, \ie, the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, \ie, audio event, visual event, and audio-visual event. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling

TL;DR

This work addresses weakly-supervised audio-visual video parsing by introducing VAPLAN, which generates segment-level pseudo labels for visual and audio streams using frozen CLIP and CLAP models (PLG). It then leverages these labels through a richness-aware loss that aligns category-richness and segment-richness with model predictions (PLE) and refines visual pseudo labels via a segment-wise denoising strategy (PLD). The approach yields state-of-the-art performance on LLP for audio, visual, and audio-visual event parsing and generalizes to related AV tasks such as AVEL, demonstrating strong potential for fine-grained, open-vocabulary supervision in multimodal video understanding.

Abstract

The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, \ie, the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, \ie, audio event, visual event, and audio-visual event. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.
Paper Structure (15 sections, 11 equations, 7 figures, 9 tables)

This paper contains 15 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An illustration of the weakly-supervised audio-visual video parsing (AVVP) task and our pseudo label exploration method. (a) Given a video and its event label ("speech" and "vacuum cleaner"), (b) AVVP task needs to predict and localize the audio events, visual events, and audio-visual events. Note that "vacuum cleaner" only exists in the visual track, while "speech" exists in both audio and visual tracks, resulting in the audio-visual event "speech". (c) To ease this challenging weakly-supervised task, we aim to explicitly assign reliable segment-level audio and visual pseudo labels. In our pseudo label generation process, the pretrained CLAP and CLIP models are used to tell what events occur in each audio and visual segment, respectively. (d) We further propose a pseudo label denoising strategy to improve the obtained visual pseudo labels by examining those segments that have abnormally large forward loss values. In the example, visual event vacuum cleaner at the third segment is assigned an incorrect pseudo label '0' and gets a large forward loss. Our pseudo-label denoising strategy further amends this, giving the accurate pseudo label '1'.
  • Figure 2: Overview of our method. As a label refining method, we aim to produce high-quality and fine-grained segment-wise event labels. For the backbone, any existing network for the AVVP task can be used to generate event predictions. Here, we adopt the baseline HAN tian2020HAN. In our solution, we design a pseudo label generation (PLG) module, where the pretrained CLIP radford2021CLIP and CLAP wu2023clap are used to generate segment-level pseudo labels for the visual and the audio modality, respectively. Notably, the parameters of the CLIP and CLAP are frozen. In the figure, we detail the visual pseudo label generation and simplify that for the audio modality since they share similar pipelines. In brief, the pseudo labels can be identified by thresholding the similarity of visual/audio--(event) text embeddings. For the $t$-th segment, the video label 'speech' is filtered out for the visual modality and only 'rooster' is remained for the audio modality. After that, with the generated pseudo labels, we propose the pseudo label exploitation (PLE) by designing a richness-aware loss as a new fully supervised objective to help the model align the category richness and segment richness in the prediction and pseudo label. Lastly, we design a pseudo label denoising (PLD) strategy that further refines the pseudo labels by reversing the positions with anomalously large forward loss values. Specifically, we re-examine the pseudo labels along the timeline. Pseudo labels of those segments with abnormal high binary cross-entropy forward loss will be refined (the motivation and implementation detail can be seen in Sec. \ref{['sec:method_PLD']}). The updated pseudo labels are further used as new supervision for model training. $\otimes$ denotes the matrix multiplication and $\odot$ is the element-wise multiplication.
  • Figure 3: Event-level F-scores of pseudo labels for each event category. (a) We display the event-level F-scores of audio and visual pseudo labels generated by PLG. (b) Compared to PLG, PLD further improves the event-level F-scores for most categories, providing more accurate visual pseudo labels. All the results are reported on the validation set of the LLP dataset.
  • Figure 4: Qualitative examples for the weakly-supervised audio-visual event localization task. This task aims to temporally locate those segments containing events that are both audible and visible. The previous state-of-the-art method, CMBS xia2022cross, utilizes only the video-level weak labels for model training and predictions. In contrast, our method can generate high-quality segment-level pseudo labels, offering fine-grained supervision during training and producing more accurate localization results. "GT" denotes the ground truth. "PL-A" and "PL-V" represent our segment-level pseudo labels for the audio and visual modalities, respectively. The audio-visual event pseudo labels ("PL-AV") result from the intersection of "PL-A" and "PL-V". Our method surpasses the vanilla CMBS model in distinguishing between the background and audio-visual events (a) as well as among different audio-visual event categories (b).
  • Figure 5: Qualitative examples of the audio-visual video parsing using different methods. We compare our method with the HAN tian2020HAN, MA wu2021MA and JoMoLD cheng2022JoMOLD. "GT" denotes the ground truth. Our method successfully recognizes that there is only one visual event violin in (a) or basketball bounce in (b). Our method is also more accurate in parsing the audio events and audio-visual events, providing better temporal boundaries of the events.
  • ...and 2 more figures