Table of Contents
Fetching ...

Referring to Any Person

Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang

TL;DR

Referring to Any Person formalizes a multi-instance referring task and introduces a large, realistic HumanRef dataset to drive progress beyond one-to-one benchmarks. It then proposes RexSeek, a retrieval-based, detection-oriented multimodal LLM that fuses a strong person detector (DINO-X) with Qwen2.5 to detect all matching individuals and reason over complex descriptions. Through a four-stage training regime, RexSeek achieves robust perception and language understanding, demonstrating superior performance on HumanRef and generalization to generic object referring. The work highlights the critical role of data design and staged training in enabling reliable, real-world referring systems with potential impact across human-robot interaction, surveillance analysis, and multimedia retrieval.

Abstract

Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek

Referring to Any Person

TL;DR

Referring to Any Person formalizes a multi-instance referring task and introduces a large, realistic HumanRef dataset to drive progress beyond one-to-one benchmarks. It then proposes RexSeek, a retrieval-based, detection-oriented multimodal LLM that fuses a strong person detector (DINO-X) with Qwen2.5 to detect all matching individuals and reason over complex descriptions. Through a four-stage training regime, RexSeek achieves robust perception and language understanding, demonstrating superior performance on HumanRef and generalization to generic object referring. The work highlights the critical role of data design and staged training in enabling reliable, real-world referring systems with potential impact across human-robot interaction, surveillance analysis, and multimedia retrieval.

Abstract

Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek

Paper Structure

This paper contains 24 sections, 6 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: We introduce referring to any person, a task that requires detecting all individuals in an image which match a given natural language description, and a new model RexSeek designed for this task with strong perception and understanding capabilities that effectively captures attributes, spatial relations, interactions, reasoning, celebrity recognition, etc.
  • Figure 2: Visualization results of Qwen2.5-VL bai2025qwen2, InternVL-2.5 chen2024expanding, and DeepSeek-VL2 wu2024deepseek on the human referring task. Despite achieving strong results on referring benchmarks RefCOCO/+/g Datasets:REFCOCOGDatasets:REFCOCO, state-of-the-art models struggle when tasked with identifying multiple individuals as they output an insufficient number of bounding boxes.
  • Figure 3: Overview of the mannual annotation pipeline of the HumanRef dataset.
  • Figure 4: Visualization of the six subsets in the HumanRef Benchmark.
  • Figure 5: Distribution of the number of individuals per image and the number of individuals referenced by each referring expression.
  • ...and 5 more figures