Table of Contents
Fetching ...

$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Xiang Li, Kai Qiu, Jinglu Wang, Xiaohao Xu, Rita Singh, Kashu Yamazak, Hao Chen, Xiaonan Huang, Bhiksha Raj

TL;DR

R^2-Bench tackles the lack of robustness evaluation in referring perception by introducing a taxonomy of perturbations, a synthesis toolbox for generating composite noises, and a benchmark across five tasks. It systematically studies the impact of perturbations on RPMs and reveals vulnerabilities, with methods like PolyFormer and SEEM showing varying robustness across modalities. The paper also proposes R^2-Agent, an LLM-based automatic evaluation assistant that automates data proposal, verification, and model analysis via a multi-agent debate mechanism. The work provides datasets, methodology, and insights to push toward safer, more resilient referring perception in real-world scenarios.

Abstract

Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses the resilience of RPMs against various perturbations in both general and specific contexts. Recognizing the complex nature of referring perception tasks, we present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of composite disturbances. Employing this toolbox, we construct $\text{R}^2$-Bench, a benchmark for assessing the Robustness of Referring perception models under noisy conditions across five key tasks. Moreover, we propose the $\text{R}^2$-Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions. Our investigation uncovers the vulnerabilities of current RPMs to various perturbations and provides tools for assessing model robustness, potentially promoting the safe and resilient integration of intelligent systems into complex real-world scenarios.

$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

TL;DR

R^2-Bench tackles the lack of robustness evaluation in referring perception by introducing a taxonomy of perturbations, a synthesis toolbox for generating composite noises, and a benchmark across five tasks. It systematically studies the impact of perturbations on RPMs and reveals vulnerabilities, with methods like PolyFormer and SEEM showing varying robustness across modalities. The paper also proposes R^2-Agent, an LLM-based automatic evaluation assistant that automates data proposal, verification, and model analysis via a multi-agent debate mechanism. The work provides datasets, methodology, and insights to push toward safer, more resilient referring perception in real-world scenarios.

Abstract

Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive. Despite progress in this field, the robustness of referring perception models (RPMs) against disruptive perturbations is not well explored. This work thoroughly assesses the resilience of RPMs against various perturbations in both general and specific contexts. Recognizing the complex nature of referring perception tasks, we present a comprehensive taxonomy of perturbations, and then develop a versatile toolbox for synthesizing and evaluating the effects of composite disturbances. Employing this toolbox, we construct -Bench, a benchmark for assessing the Robustness of Referring perception models under noisy conditions across five key tasks. Moreover, we propose the -Agent, an LLM-based agent that simplifies and automates model evaluation via natural language instructions. Our investigation uncovers the vulnerabilities of current RPMs to various perturbations and provides tools for assessing model robustness, potentially promoting the safe and resilient integration of intelligent systems into complex real-world scenarios.
Paper Structure (37 sections, 5 equations, 11 figures, 7 tables)

This paper contains 37 sections, 5 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Motivation illustration. Referring perception models (RPMs) empower intelligent systems with their ability to perform object grounding within the environment based on referring guidance, such as textual descriptions, imagery exemplars, or auditory signals associated with the target object. However, RPMs' performance can be compromised by disturbances in real-world scenarios, such as environmental noise (e.g., extraneous sounds from a nearby radio), human-induced errors (e.g., typographical errors in textual input), and limitations in the sensor (e.g., motion blur in images). Conducting a rigorous analysis of RPMs' robustness to a wide array of perturbations is necessary for building reliable real-world applications.
  • Figure 2: Examples of $\text{R}^2$-Bench. The "Original" row displays the original inputs alongside the outcomes from models of R-VOS wu2023onlinerefer, AVS li2024towards and Q3M jatavallabhula2023conceptfusion, while the "Perturbed" row presents the inputs as synthesized by $\text{R}^2$-Bench and the respective outcomes with the same models. RG: short for Referring Guidance.
  • Figure 3: Noise categories based on their origins. Assuming airplane as the source of referring guidance, noise from it is categorized as source noise.
  • Figure 4: Overview of $\text{R}^2$-Agent, the automatic evaluation assistant. Given a human instruction, clean datasets, perturbation functions, and evaluation functions, $\text{R}^2$-Agent first proposes and verifies perturbed test samples that match the given instruction. After that, $\text{R}^2$-Agent evaluates the model using the verified samples and provides a report that articulates the model's vulnerabilities and overall resilience.
  • Figure 5: Chain-of-thought prompting template for data verification. ①&②: Following the chain-of-thought spirit, we first give examples to LLM to boost the in-context-learning capability. After that, we ask the LLM to answer a question that is similar to the given example. Specifically, for the data verification task, we ask the LLM to verify the selected samples from the previous iteration, update the results, and explain the reason. The LLM response is instructed to be a Python-format list of dictionaries. ③: The response from LLM (which follows the desired Python-format list).
  • ...and 6 more figures