Table of Contents
Fetching ...

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Yuan Zhao, Zhenqi Jia, Yongqiang Zhang

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.

MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.

Paper Structure

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Illustration of the Reference Audio-Visual Segmentation (Ref-AVS) task. (b) Our proposed MAR3 first explicitly recognizes the difficulty of reference expressions and the dominant modality of multimodal cues, then performs referred object reasoning based on the modality-dominant difficulty rule, and finally generates the precise prediction mask through iterative optimization of the referred object prompt based on intermediate segmentation results.
  • Figure 2: The architecture of our MAR3, which contains three mechanisms: Consensus Multimodal Recognition (CMR), Collaborative Object Reasoning (COR), and Reflective Learning Segmentation (RLS). CMR explicitly recognizes the difficulty and dominant modality in the referring expression and multimodal cues using the sociological Delphi theory. COR is designed to reliably reason about the referred object through collaboration between dominant- and auxiliary-modality agents. RLS ensures precise mask prediction where a check agent iteratively corrects the object text prompts for the segment agent based on intermediate segmentation results.
  • Figure 3: The visualization results of our MAR3, along with the second-best method TGS-Agent. Additional qualitative visualizations and comparative examples can be found in Appendix B of the supplemental material.
  • Figure 4: Difficulty proportions of reference expressions on the Ref-AVSBench test set, identified by our Consensus Multimodal Recognition (CMR) mechanism based on the modality-dominant difficulty rule.
  • Figure 5: Dominant modality proportions of reference expressions on the Ref-AVSBench test set, identified by our Consensus Multimodal Recognition (CMR) mechanism based on the modality-dominant difficulty rule.