Table of Contents
Fetching ...

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu

TL;DR

This work tackles AVSS by identifying the optimization clash between audio-visual alignment and semantic understanding in end-to-end training. It proposes Stepping Stones, a two-stage strategy that first learns AVS localization with binary labels and then semantic AVSS using the stage-1 results as stepping stones, complemented by Robust Audio-aware Keys to mitigate stage-1 errors. It also introduces Adaptive Audio Visual Segmentation (AAVS), a transformer-based framework with an adaptive audio query generator and masked attention to dynamically fuse audio and visual information. Across AVSBench benchmarks, AAVS with Stepping Stones achieves state-of-the-art performance, especially on the AVSS task, demonstrating strong generalization and improved convergence. The findings underscore the value of staged learning for complex multimodal tasks and offer a practical pathway to robust audio-visual semantic segmentation in real-world videos.

Abstract

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The project homepage can be accessed at https://gewu-lab.github.io/stepping_stones/.

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

TL;DR

This work tackles AVSS by identifying the optimization clash between audio-visual alignment and semantic understanding in end-to-end training. It proposes Stepping Stones, a two-stage strategy that first learns AVS localization with binary labels and then semantic AVSS using the stage-1 results as stepping stones, complemented by Robust Audio-aware Keys to mitigate stage-1 errors. It also introduces Adaptive Audio Visual Segmentation (AAVS), a transformer-based framework with an adaptive audio query generator and masked attention to dynamically fuse audio and visual information. Across AVSBench benchmarks, AAVS with Stepping Stones achieves state-of-the-art performance, especially on the AVSS task, demonstrating strong generalization and improved convergence. The findings underscore the value of staged learning for complex multimodal tasks and offer a practical pathway to robust audio-visual semantic segmentation in real-world videos.

Abstract

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The project homepage can be accessed at https://gewu-lab.github.io/stepping_stones/.
Paper Structure (29 sections, 4 equations, 4 figures, 4 tables)

This paper contains 29 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between previous methods and our Stepping Stones training strategy. Top: Previous end-to-end AVSS approaches result in sub-optimization on audio-visual alignment. Specifically, when trained under the AVSS-setting, these methods exhibit weaker sound source localization capability compared to those trained under the AVS-setting. Bottom: our Stepping Stones training strategy decomposes the intricate AVSS task into two relatively simple subtasks to be fully learned in two stages, enhancing the performance significantly.
  • Figure 2: Overview of AAVS framework. (1) Visual and audio features are extracted by the pre-trained encoder; (2) Adaptive Audio Query Generator is proposed to generate audio queries; (3) In the transformer decoder, audio-aware queries are integrated with visual feature maps, and masked cross-attention facilitates queries to dynamically adjust the attention range; (4) Finally, refined queries are merged with the mask feature to obtain the final prediction mask. Red arrows indicate newly introduced methods when implementing the Stepping Stones strategy.
  • Figure 3: Qualitative comparison with previous methods. The left is from the S4 subtask, the center is from the MS3 subtask, and the right side is from the AVSS subtask.
  • Figure 4: Previous methods exhibit insufficient learning performance under the AVSS setting. The last column represents the model applying the Stepping Stones training strategy. Obviously, the model demonstrates inadequate learning when trained in an end-to-end AVSS setting, whereas the localization accuracy of the semantic mask predicted by the model experiences a notable enhancement following the implementation of Stepping Stones.