Table of Contents
Fetching ...

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension

Junzhuo Liu, Xuzheng Yang, Weiwei Li, Peng Wang

TL;DR

This work presents FineCops-Ref, a new Referring Expression Comprehension dataset with controllable difficulty levels and hard negative samples to probe fine-grained compositional reasoning and robust visual grounding. It combines scene-graph-derived positive data, LLM-driven rewrites, and diffusion-based edits to generate challenging negatives, evaluated across traditional specialist methods and multimodal LLMs. Key findings show a substantial gap in grounding performance as difficulty and negatives increase, with fine-tuning on the proposed data yielding notable gains and some generalization to RefCOCO/+/g. The dataset and generation pipeline are released to advance research on robust cross-modal grounding and compositional reasoning in MLLMs.

Abstract

Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model's ability to correctly reject scenarios where the target object is not visible in the image--an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at https://github.com/liujunzhuo/FineCops-Ref.

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension

TL;DR

This work presents FineCops-Ref, a new Referring Expression Comprehension dataset with controllable difficulty levels and hard negative samples to probe fine-grained compositional reasoning and robust visual grounding. It combines scene-graph-derived positive data, LLM-driven rewrites, and diffusion-based edits to generate challenging negatives, evaluated across traditional specialist methods and multimodal LLMs. Key findings show a substantial gap in grounding performance as difficulty and negatives increase, with fine-tuning on the proposed data yielding notable gains and some generalization to RefCOCO/+/g. The dataset and generation pipeline are released to advance research on robust cross-modal grounding and compositional reasoning in MLLMs.

Abstract

Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model's ability to correctly reject scenarios where the target object is not visible in the image--an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at https://github.com/liujunzhuo/FineCops-Ref.
Paper Structure (29 sections, 1 equation, 8 figures, 14 tables)

This paper contains 29 sections, 1 equation, 8 figures, 14 tables.

Figures (8)

  • Figure 1: The data construction pipeline of FineCops-Ref. Given an image, we first generate paths based on its scene graph. Then, we fill paths into templates and obtain the positive referring expression through LLM rewriting. Meanwhile, we utilize LLM to generate negative expressions, and based on this, we employ diffusion model to create fine-grained editing negative images.
  • Figure 2: The relationship between Precision@1 (on positive samples) and Recall@1 (on positive and negative samples) for Specialist and MLLM models across different negative difficulty levels. Specialist models correlate strongly with easier negative samples (Negative level 1, PCC = 0.923), while MLLMs show a higher correlation with harder negatives (Negative level 2, PCC = 0.917), reflecting their differing focuses on compositional REC.
  • Figure 3: Positive expressions of different difficulty levels.
  • Figure 4: Positive expressions of different syntactic structure types.
  • Figure 5: Negative images generated by different methods.
  • ...and 3 more figures