Table of Contents
Fetching ...

Unveiling and Mitigating Bias in Audio Visual Segmentation

Peiwen Sun, Honggang Zhang, Di Hu

TL;DR

The paper tackles bias in Audio-Visual Segmentation by identifying two core phenomena—audio priming bias and visual prior—that degrade grounding quality. It proposes semantic-aware active queries to enhance audio semantics perception and a soft, uncertainty-based debiasing strategy to mitigate visual priors without altering model structure. Through extensive experiments on AVS benchmarks (including a synthetic Co-AVS subset), the approach achieves competitive results and robust bias mitigation, as shown by ablations that separately validate audio- and visual-bias components and their interaction. Overall, the work delivers a versatile, architecture-preserving toolkit for improving multimodal grounding reliability in AVS with practical impact for real-world multimodal perception systems.

Abstract

Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types "audio priming bias" and "visual prior" according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.

Unveiling and Mitigating Bias in Audio Visual Segmentation

TL;DR

The paper tackles bias in Audio-Visual Segmentation by identifying two core phenomena—audio priming bias and visual prior—that degrade grounding quality. It proposes semantic-aware active queries to enhance audio semantics perception and a soft, uncertainty-based debiasing strategy to mitigate visual priors without altering model structure. Through extensive experiments on AVS benchmarks (including a synthetic Co-AVS subset), the approach achieves competitive results and robust bias mitigation, as shown by ablations that separately validate audio- and visual-bias components and their interaction. Overall, the work delivers a versatile, architecture-preserving toolkit for improving multimodal grounding reliability in AVS with practical impact for real-world multimodal perception systems.

Abstract

Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types "audio priming bias" and "visual prior" according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.
Paper Structure (19 sections, 12 equations, 6 figures, 4 tables)

This paper contains 19 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The illogical anomalies caused by biases. (a) Even if each audio yields a satisfactory mask separately, the dominance still exists when they overlap. (b) In training data, the piano generally sounds. During testing, regardless of the sound presented, the model still tends to prioritize the piano. These two phenomena of the illogical anomalies can be categorized into “audio priming bias", and “visual prior", which typically are simultaneously observed as a general impediment in the AVS model.
  • Figure 2: The illustration of audio priming bias. We have discovered that higher audio intensity results in stronger guiding capabilities in the green block. Also, the audio with distinct semantic attributes is easier to learn and possesses stronger guiding capabilities in the grey block. For instance, music has a greater guiding capability compared to human sound. Consequently, the diverse guiding capabilities cause the phenomenon of dominance in the red block.
  • Figure 3: In conventional semantic segmentation, the learnable queries are solely responsible for generating regions. However, in an ideal AVS model with a similar structure, the learnable queries need to not only generate regions but also perceive and regulate the interaction between audio semantics. So, this mechanism is designed to enhance the understanding of latent semantics and improve the interaction to only occur between semantic-aware active queries.
  • Figure 4: Illustration of visual prior. (a) The model always tends to learn statistically plausible results, rather than achieve the harder desired grounding behavior. Note: the bar chart uses blue to represent the proportion of the object being present in the image and emitting sound, while the orange color represents the proportion of the object being present but not emitting sound. As an example, since the piano generally appears with a high sounding probability in the training data, the model will segment once sees the piano. (b) To deal with the visual prior, we introduce debias strategies through the idea of contrasting the audio-visual model with the biased branch. (c) The ideal result is the mask without visual prior.
  • Figure 5: (a) The performance comparison of different methods on V1M under certain intensity conditions. Our method brings more sensitivity to low-intensity scenarios. (b) The performance comparison of different methods on our Co-AVS subsets. Our method has advantages in both performance and robustness.
  • ...and 1 more figures