CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, Yun Fu
TL;DR
This work targets the weakness of Multimodal Large Language Models (MLLMs) in grounding complex referring expressions by introducing Chain-of-Thought Referring (CoTR). It formalizes a four-stage data curation pipeline to convert complex queries into ordered anchor-based reasoning sequences and pairs them with visual groundings, plus a Composite Referring Benchmark to stress compositionality. The authors present RefLM, a unified architecture that uses box-point prompts and the Segment Anything Model (SAM) for mask generation, guided by an adaptive weighted loss that emphasizes the final target grounding. Empirical results on RefCOCO-derived benchmarks and the proposed composite benchmark demonstrate consistent gains, validating the effectiveness of explicit, sequential reasoning in improving localization and segmentation under challenging referring expressions.
Abstract
Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.
