Table of Contents
Fetching ...

CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, Yun Fu

TL;DR

This work targets the weakness of Multimodal Large Language Models (MLLMs) in grounding complex referring expressions by introducing Chain-of-Thought Referring (CoTR). It formalizes a four-stage data curation pipeline to convert complex queries into ordered anchor-based reasoning sequences and pairs them with visual groundings, plus a Composite Referring Benchmark to stress compositionality. The authors present RefLM, a unified architecture that uses box-point prompts and the Segment Anything Model (SAM) for mask generation, guided by an adaptive weighted loss that emphasizes the final target grounding. Empirical results on RefCOCO-derived benchmarks and the proposed composite benchmark demonstrate consistent gains, validating the effectiveness of explicit, sequential reasoning in improving localization and segmentation under challenging referring expressions.

Abstract

Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.

CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

TL;DR

This work targets the weakness of Multimodal Large Language Models (MLLMs) in grounding complex referring expressions by introducing Chain-of-Thought Referring (CoTR). It formalizes a four-stage data curation pipeline to convert complex queries into ordered anchor-based reasoning sequences and pairs them with visual groundings, plus a Composite Referring Benchmark to stress compositionality. The authors present RefLM, a unified architecture that uses box-point prompts and the Segment Anything Model (SAM) for mask generation, guided by an adaptive weighted loss that emphasizes the final target grounding. Empirical results on RefCOCO-derived benchmarks and the proposed composite benchmark demonstrate consistent gains, validating the effectiveness of explicit, sequential reasoning in improving localization and segmentation under challenging referring expressions.

Abstract

Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.

Paper Structure

This paper contains 33 sections, 3 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: An example showcasing the answer of RefLM with and without CoT referring. The underlined words in the answer are the anchors, and the highlighted words correspond to masks of the same color (for clarity, we visualize only the box).
  • Figure 2: Model performance comparison across varying complexity levels of referring texts in FineCops-Ref positive subset. Complexity is defined by the maximum hop level from anchor noun to target. The performance clearly declines as complexity increases.
  • Figure 3: Data pipeline for generating CoT Referring training data and the Composite Referring Benchmark. Qwen3 first extracts anchors and the target and rewrites the CoT answer. The parsed nouns and hop levels are validated by DeepSeek-V3. Qwen2.5-VL then grounds each noun with a bounding box. GPT-4o is used to verify each box, where the target box must satisfy $\mathrm{IoU}_{\mathrm{GT}} > 0.7$. For the benchmark, the maximum hop level $L_{\max}$ quantifies the compositionality of each referring annotation.
  • Figure 4: Examples from our curated data. Each example shows the referring expression, its text anchors with hop levels (H.L.), and the corresponding noun groundings. Highlighted nouns correspond to mask of the same color in the image.
  • Figure 5: (a) RefLM architecture. The model consists of a vision encoder (including projector), and a language model (LLM). Visual prompts generated by the MLLM are fed into the Segment Anything Model (SAM) for mask generation. (b) An example of our model output, including the LLM output and the SAM output.
  • ...and 4 more figures