Table of Contents
Fetching ...

RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Jie Huang, Ruibing Hou, Jiahe Zhao, Hong Chang, Shiguang Shan

TL;DR

This work introduces RefHCM, a unified framework for referring human perceptions that integrates multiple tasks (localization, pose, parsing, and captioning) into a single sequence-to-sequence model. It achieves this by converting heterogeneous inputs into a common token sequence via sequence mergers and dispensers, and by employing a universal encoder–decoder backbone with specialized mechanisms such as Location-Context Restriction (LCR) and Query Parallel Generation (QPG) for efficiency and accuracy. The authors also present the ReasonRef benchmark to evaluate reasoning-based referencing across five dimensions, demonstrating RefHCM's zero-shot reasoning capabilities and strong cross-task transfer. Across REC, RKpt, RPar, and RHrc, RefHCM attains competitive or superior results, and the setup enables scalable and interactive human-centric AI applications such as chatbots and sports analytics.

Abstract

Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM's competitive and even superior performance across multiple human-centric referring tasks. The code and data are publicly at https://github.com/JJJYmmm/RefHCM.

RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

TL;DR

This work introduces RefHCM, a unified framework for referring human perceptions that integrates multiple tasks (localization, pose, parsing, and captioning) into a single sequence-to-sequence model. It achieves this by converting heterogeneous inputs into a common token sequence via sequence mergers and dispensers, and by employing a universal encoder–decoder backbone with specialized mechanisms such as Location-Context Restriction (LCR) and Query Parallel Generation (QPG) for efficiency and accuracy. The authors also present the ReasonRef benchmark to evaluate reasoning-based referencing across five dimensions, demonstrating RefHCM's zero-shot reasoning capabilities and strong cross-task transfer. Across REC, RKpt, RPar, and RHrc, RefHCM attains competitive or superior results, and the setup enables scalable and interactive human-centric AI applications such as chatbots and sports analytics.

Abstract

Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM's competitive and even superior performance across multiple human-centric referring tasks. The code and data are publicly at https://github.com/JJJYmmm/RefHCM.

Paper Structure

This paper contains 18 sections, 14 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Overview of RefHCM model. RefHCM can handle four referring human-centric tasks in a unified way. Taking the referring keypoint task as an example, RefHCM first tokenizes the input image using the image merger $\mathcal{M}_I$ and the corresponding task instruction using the text merger $\mathcal{M}_T$. The resulting token sequence is then passed through the encoder-decoder architecture to generate the desired output token sequence. Finally, the keypoint dispenser $\mathcal{P}_K$ transforms the output tokens into human keypoints.
  • Figure 2: The illustration of Location-Context Restriction, which prepends the human bounding box to the keypoint output of RKpt task. Through the autoregressive decoding process, bounding boxes and keypoints are generated sequentially, allowing for mutual positive constraints between them.
  • Figure 3: Overview of QPG (Query Parallel Generation), which significantly boosts inference speed. It is worth noting that Queries can see each other, akin to full mask attention. During the inference phase, the generation method shifts from auto-regressive to parallel generation upon encountering the parsing map query token $<\text{BOP}>$. M represents the size of codebook in the parsing VQ-VAE, which also corresponds to the prediction range for the decoder.
  • Figure 4: ReasonRef benchmark construction pipeline. We use GPT-4 to generate descriptions across five dimensions covering identity, pose/clothing, relations, and future prediction, then manually verify generated descriptions.
  • Figure 5: Qualitative results on RKpt task. RefHCM produces more precise predictions for occluded keypoints.
  • ...and 2 more figures