Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

Jinxing Zhou; Yanghao Zhou; Yaoting Wang; Zongyan Han; Jiaqi Ma; Henghui Ding; Rao Muhammad Anwer; Hisham Cholakkal

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

Jinxing Zhou, Yanghao Zhou, Yaoting Wang, Zongyan Han, Jiaqi Ma, Henghui Ding, Rao Muhammad Anwer, Hisham Cholakkal

TL;DR

MQ-Auditor is proposed, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement.

Abstract

Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA-RefAVS.

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

TL;DR

Abstract

Paper Structure (23 sections, 13 equations, 13 figures, 11 tables)

This paper contains 23 sections, 13 equations, 13 figures, 11 tables.

Introduction
Task: MQA-RefAVS
Dataset: MQ-RAVSBench
Data Source and Split
Mask Taxonomy and Quality Annotation
Training and Evaluation Protocols
Method: MQ-Auditor
Experiments
Main Results
Ablation Studies
Segmentation Improvement via MQ-Auditor
Conclusion
Related Work
More Dataset Statistics
Calculation Details of Evaluation Metrics
...and 8 more sections

Figures (13)

Figure 1: Task illustration. Prior Ref-AVS methods aim to segment the target object. In contrast, our proposed MQA-RefAVS task focuses on automatic mask quality assessment, enabling to identify mask errors and provide suitable actions for further refinement.
Figure 2: Mask construction pipeline of MQ-RAVSBench. For training and image-based evaluation, we employ an object detection model, Detic zhou2022detecting, to identify a key frame containing the richest set of objects. Based on the ground-truth Perfect masks from Ref-AVSBench wang2024ref, we use OpenCV library to generate masks with geometric quality issues, including Cutout, Dilate, and Erode. Besides, we construct a pipeline using powerful MLLMs/VLMs to generate Full_neg masks, which correspond to entirely incorrect objects and exhibit severe semantic quality issues. By combining Full_neg and Perfect masks, we obtain the Merge masks.
Figure 3: Illustration of our MQ-Auditor model.
Figure 4: IoU distribution of MQ-RAVSBench. For the test set, IoU statistics are computed based on samples used in the image-based evaluation. The IoU values for the Perfect and Full_neg types are always 1 and 0, respectively. The Cutout/Dilate/Erode masks typically exhibit higher IoU values around 0.8; we intentionally control this range to avoid overly obvious quality errors that would trivialize assessment. The IoU values of Merge masks span the full range from 0 to 1, depending on the relative area between the ground-truth object and the merged negative regions. For example, when the ground-truth object is small and the merged negative objects are large, the resulting Merge mask yields a low IoU; otherwise, a higher IoU is obtained.
Figure 5: Qualitative comparison of different mask quality assessment approaches. Mask type: Perfect.
...and 8 more figures

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

TL;DR

Abstract

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)