Table of Contents
Fetching ...

Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Yongbiao Gao, Xiangcheng Sun, Guohua Lv, Deng Yu, Sijiu Niu

TL;DR

This work tackles weakly-supervised audio-visual video parsing (AVVP) by introducing Reinforcement Learning-based Label Denoising (RLLD), a unified framework that simultaneously learns label denoising and AVVP. The approach uses a label-denoising network guided by a task network (HAN-based MMIL) and trains with a policy-gradient objective, where rewards come from both validation performance and a soft inter-reward that aligns denoising with parsing accuracy. By integrating denoising and parsing, RLLD directly optimizes for AVVP outcomes rather than treating denoising as a separate preprocessing step. Empirical results on LLP show state-of-the-art gains over existing label-denoising methods and improvements when plugged into other AVVP models, with ablations underscoring the importance of initialized labels and the soft inter-reward in achieving robust performance.

Abstract

Audio-visual video parsing (AVVP) aims to recognize audio and visual event labels with precise temporal boundaries, which is quite challenging since audio or visual modality might include only one event label with only the overall video labels available. Existing label denoising models often treat the denoising process as a separate preprocessing step, leading to a disconnect between label denoising and AVVP tasks. To bridge this gap, we present a novel joint reinforcement learning-based label denoising approach (RLLD). This approach enables simultaneous training of both label denoising and video parsing models through a joint optimization strategy. We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy. Extensive experiments on AVVP tasks demonstrate the superior performance of our proposed method compared to label denoising techniques. Furthermore, by incorporating our label denoising method into other AVVP models, we find that it can further enhance parsing results.

Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

TL;DR

This work tackles weakly-supervised audio-visual video parsing (AVVP) by introducing Reinforcement Learning-based Label Denoising (RLLD), a unified framework that simultaneously learns label denoising and AVVP. The approach uses a label-denoising network guided by a task network (HAN-based MMIL) and trains with a policy-gradient objective, where rewards come from both validation performance and a soft inter-reward that aligns denoising with parsing accuracy. By integrating denoising and parsing, RLLD directly optimizes for AVVP outcomes rather than treating denoising as a separate preprocessing step. Empirical results on LLP show state-of-the-art gains over existing label-denoising methods and improvements when plugged into other AVVP models, with ablations underscoring the importance of initialized labels and the soft inter-reward in achieving robust performance.

Abstract

Audio-visual video parsing (AVVP) aims to recognize audio and visual event labels with precise temporal boundaries, which is quite challenging since audio or visual modality might include only one event label with only the overall video labels available. Existing label denoising models often treat the denoising process as a separate preprocessing step, leading to a disconnect between label denoising and AVVP tasks. To bridge this gap, we present a novel joint reinforcement learning-based label denoising approach (RLLD). This approach enables simultaneous training of both label denoising and video parsing models through a joint optimization strategy. We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy. Extensive experiments on AVVP tasks demonstrate the superior performance of our proposed method compared to label denoising techniques. Furthermore, by incorporating our label denoising method into other AVVP models, we find that it can further enhance parsing results.
Paper Structure (16 sections, 15 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 15 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: The overall framework of our proposed joint training RLLD for AVVP. The label denoising module aims to generate the denoising policy and the task module feedback the joint reward to guide the label denoising module learning.
  • Figure 2: Segment-level results of VALOR and RLLD+VALOR.
  • Figure 3: Segment-level results of CPSP and RLLD+CPSP.
  • Figure 4: Event-level results of VALOR and RLLD+VALOR.
  • Figure 5: Event-level results of CPSP and RLLD+CPSP.
  • ...and 6 more figures