Table of Contents
Fetching ...

Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S. -H. Gary Chan, Hongyang Zhang

TL;DR

The paper critiques existing referring expression comprehension benchmarks for underestimating modern large multimodal models due to labeling noise and overly short expressions. It introduces Ref-L4, a large-scale, diverse REC benchmark with long expressions and a vast vocabulary, generated via GPT-4V with human validation and expanded through rephrasing. The authors reevaluate 24 LMMs on Ref-L4 and demonstrate that benchmark noise significantly affects reported performance, while cleaned versions of RefCOCO/RefCOCO+/RefCOCOg reveal more accurate model capabilities. They provide a comprehensive evaluation protocol, including Acc@k, mAcc, scale-aware analyses, per-category performance, and cross-source assessments, alongside open data and code for broader adoption.

Abstract

Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC models. Ref-L4 is distinguished by four key features: 1) a substantial sample size with 45,341 annotations; 2) a diverse range of object categories with 365 distinct types and varying instance scales from 30 to 3,767; 3) lengthy referring expressions averaging 24.2 words; and 4) an extensive vocabulary comprising 22,813 unique words. We evaluate a total of 24 large models on Ref-L4 and provide valuable insights. The cleaned versions of RefCOCO, RefCOCO+, and RefCOCOg, as well as our Ref-L4 benchmark and evaluation code, are available at https://github.com/JierunChen/Ref-L4.

Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

TL;DR

The paper critiques existing referring expression comprehension benchmarks for underestimating modern large multimodal models due to labeling noise and overly short expressions. It introduces Ref-L4, a large-scale, diverse REC benchmark with long expressions and a vast vocabulary, generated via GPT-4V with human validation and expanded through rephrasing. The authors reevaluate 24 LMMs on Ref-L4 and demonstrate that benchmark noise significantly affects reported performance, while cleaned versions of RefCOCO/RefCOCO+/RefCOCOg reveal more accurate model capabilities. They provide a comprehensive evaluation protocol, including Acc@k, mAcc, scale-aware analyses, per-category performance, and cross-source assessments, alongside open data and code for broader adoption.

Abstract

Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC models. Ref-L4 is distinguished by four key features: 1) a substantial sample size with 45,341 annotations; 2) a diverse range of object categories with 365 distinct types and varying instance scales from 30 to 3,767; 3) lengthy referring expressions averaging 24.2 words; and 4) an extensive vocabulary comprising 22,813 unique words. We evaluate a total of 24 large models on Ref-L4 and provide valuable insights. The cleaned versions of RefCOCO, RefCOCO+, and RefCOCOg, as well as our Ref-L4 benchmark and evaluation code, are available at https://github.com/JierunChen/Ref-L4.
Paper Structure (18 sections, 12 figures, 5 tables)

This paper contains 18 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Examples from our Ref-L4 benchmark. We offer a detailed referring expression for each target instance represented by a bounding box. Zoom in for better visualization.
  • Figure 2: Pipeline of generating a referring expression for a target instance.
  • Figure 3: Analysis of referring expression length, instance size, and category distribution.
  • Figure 4: The frequency of the 10 most frequently used words in each part-of-speech category, as parsed using the SpaCy library.
  • Figure 5: Category-wise performance of the four top-performing models on the val+test set, sorted in descending order based on their average per-category performance. The performance of all models can be found in Section \ref{['sec:appendix-category-wise']}.
  • ...and 7 more figures