Table of Contents
Fetching ...

Audio-Visual Segmentation via Unlabeled Frame Exploitation

Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang

TL;DR

This work divides the unlabeled frames for audio-visual segmentation into two categories based on their temporal characteristics, and proposes a versatile framework that effectively leverages them to tackle AVS.

Abstract

Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.

Audio-Visual Segmentation via Unlabeled Frame Exploitation

TL;DR

This work divides the unlabeled frames for audio-visual segmentation into two categories based on their temporal characteristics, and proposes a versatile framework that effectively leverages them to tackle AVS.

Abstract

Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
Paper Structure (13 sections, 7 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 13 sections, 7 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between previous methods and ours on how to harness the unlabeled frames. (a) Previous methods perform global temporal modeling (GTM) to process all frames from a sequence including labeled and unlabeled ones, without the exploitation of the unlabeled frames. (b) Our method employs two types of unlabeled frames: (i) the neighboring frames (NFs) provide motion cues for accurately segmenting the sounding object; (ii) the distant frames (DFs) contain semantic cues for enhancing data diversity. (c) Based on TPAVI method, compared to the model trained only using labeled frames (w/o GTM), previous methods using global temporal modeling (w/ GTM) only show marginal performance gain; while our method achieves significant improvement with the unlabeled frames.
  • Figure 2: Overview of our framework to exploit unlabeled frames. (a) Teacher-student network for training. Student network is optimized with $\mathcal{L}_{sup}$ and $\mathcal{L}_{unsup}$. $\mathcal{L}_{sup}$ is computed with the predicted mask $f_\theta(x^l)$ and its groundtruth for the labeled frame $x^l$; $\mathcal{L}_{unsup}$ is computed between $f_\theta(x^u_s)$ from the student and the predicted pseudo mask $f_\theta(x^u_w)$ for the strong-augmented unlabeled image from teacher. (b) Inference pipeline of the framework. We incorporate flow as auxiliary input to exploit the motion cues within NFs.
  • Figure 3: Qualitative comparison between our method and AVSegFormer gao2023avsegformer on both subsets of AVSBench. Our method shows better segmentation performance by localising the exact sounding object, attending to the fine-grained details and being closer to groundtruths.