Table of Contents
Fetching ...

GREC: Generalized Referring Expression Comprehension

Shuting He, Henghui Ding, Chang Liu, Xudong Jiang

TL;DR

This work exposes the limitations of Classic Referring Expression Comprehension (REC) in handling no-target and multi-target expressions. It introduces Generalized Referring Expression Comprehension (GREC) and the gRefCOCO dataset, along with new evaluation metrics that account for multiple bounding boxes and no-target cases. Empirical results show existing REC methods underperform on GREC, and a threshold-based, dynamic box-selection strategy offers the most effective grounding performance. The contributions provide a more realistic, versatile grounding framework with practical implications for multi-object grounding and image retrieval tasks. The work also provides benchmark resources and baseline implementations to accelerate future research.

Abstract

The objective of Classic Referring Expression Comprehension (REC) is to produce a bounding box corresponding to the object mentioned in a given textual description. Commonly, existing datasets and techniques in classic REC are tailored for expressions that pertain to a single target, meaning a sole expression is linked to one specific object. Expressions that refer to multiple targets or involve no specific target have not been taken into account. This constraint hinders the practical applicability of REC. This study introduces a new benchmark termed as Generalized Referring Expression Comprehension (GREC). This benchmark extends the classic REC by permitting expressions to describe any number of target objects. To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO. This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions. The design of GREC and gRefCOCO ensures smooth compatibility with classic REC. The proposed gRefCOCO dataset, a GREC method implementation code, and GREC evaluation code are available at https://github.com/henghuiding/gRefCOCO.

GREC: Generalized Referring Expression Comprehension

TL;DR

This work exposes the limitations of Classic Referring Expression Comprehension (REC) in handling no-target and multi-target expressions. It introduces Generalized Referring Expression Comprehension (GREC) and the gRefCOCO dataset, along with new evaluation metrics that account for multiple bounding boxes and no-target cases. Empirical results show existing REC methods underperform on GREC, and a threshold-based, dynamic box-selection strategy offers the most effective grounding performance. The contributions provide a more realistic, versatile grounding framework with practical implications for multi-object grounding and image retrieval tasks. The work also provides benchmark resources and baseline implementations to accelerate future research.

Abstract

The objective of Classic Referring Expression Comprehension (REC) is to produce a bounding box corresponding to the object mentioned in a given textual description. Commonly, existing datasets and techniques in classic REC are tailored for expressions that pertain to a single target, meaning a sole expression is linked to one specific object. Expressions that refer to multiple targets or involve no specific target have not been taken into account. This constraint hinders the practical applicability of REC. This study introduces a new benchmark termed as Generalized Referring Expression Comprehension (GREC). This benchmark extends the classic REC by permitting expressions to describe any number of target objects. To achieve this goal, we have built the first large-scale GREC dataset named gRefCOCO. This dataset encompasses a range of expressions: those referring to multiple targets, expressions with no specific target, and the single-target expressions. The design of GREC and gRefCOCO ensures smooth compatibility with classic REC. The proposed gRefCOCO dataset, a GREC method implementation code, and GREC evaluation code are available at https://github.com/henghuiding/gRefCOCO.
Paper Structure (8 sections, 3 figures, 3 tables)

This paper contains 8 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Classic Referring Expression Comprehension (REC) only supports expressions that indicate a single target object, e.g., (1). Compared with classic REC, the proposed Generalized Referring Expression Comprehension (GREC) supports expressions indicating an arbitrary number of target objects, for example, multi-target expressions like (2)-(5), and no-target expressions like (6).
  • Figure 2: More applications of GREC brought by supporting multi-target and no-target expressions compared to classic REC.
  • Figure 3: Exemplary results of the modified MDETR MDETR on gRefCOCO dataset. The ground truth is denoted by red bounding boxes, whereas green bounding boxes represent the predictive results. The uppermost row showcases examples of successful outcomes, while the subsequent two rows depict examples of failure cases for multi-target and no-target scenarios, respectively.