Table of Contents
Fetching ...

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, Lei Zhang

TL;DR

Rex-Thinker reframes object referring as a grounded, step-by-step CoT reasoning task over candidate object boxes, enabling verifiable predictions tied to visual evidence. A large, CoT-annotated HumanRef-CoT dataset supports two-stage training (SFT followed by GRPO) to improve accuracy and reduce hallucinations, while preserving interpretability. The approach yields state-of-the-art results on in-domain HumanRef and strong zero-shot generalization to out-of-domain RefCOCOg, with GRPO providing additional gains and improved rejection behaviour. Limitations include weaker handling of multi-object interactions, suggesting avenues for enhancing relational reasoning and consistency in future work.

Abstract

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

TL;DR

Rex-Thinker reframes object referring as a grounded, step-by-step CoT reasoning task over candidate object boxes, enabling verifiable predictions tied to visual evidence. A large, CoT-annotated HumanRef-CoT dataset supports two-stage training (SFT followed by GRPO) to improve accuracy and reduce hallucinations, while preserving interpretability. The approach yields state-of-the-art results on in-domain HumanRef and strong zero-shot generalization to out-of-domain RefCOCOg, with GRPO providing additional gains and improved rejection behaviour. Limitations include weaker handling of multi-object interactions, suggesting avenues for enhancing relational reasoning and consistency in future work.

Abstract

Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Paper Structure

This paper contains 45 sections, 5 equations, 47 figures, 10 tables.

Figures (47)

  • Figure 1: An example of Rex-Thinker for object referring with CoT reasoning of planning (task decomposition), action (evaluating each candidate), and summarization (final decision). Each step is grounded in a specific hint box (as denoted in the left image), enabling interpretable predictions.
  • Figure 2: Overview of the proposed CoT reasoning referring data engine. We prompt GPT-4o to generate a three-step CoT reasoning process, including planning, action, and summarization.
  • Figure 3: Overview of the Rex-Thinker architecture and our two-stage training methods
  • Figure 3: Out-of-domain evaluation results on RefCOCOg. $^*$Fine-tuned on RefCOCOg using GRPO.
  • Figure 4: The out-of-domain result. We use Rex-Thinker-GPRO trained on HumanRef-CoT to infer an unseen category (i.e., fish), resulting in a strong generalization. Boxes in the image denote hints.
  • ...and 42 more figures