Table of Contents
Fetching ...

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo

TL;DR

HQNet introduces a unified, single-stage framework for multi-person, multi-task human-centric perception by learning a shared Human Query that encodes instance-specific features at multiple granularities. The approach is evaluated on the COCO-UniHuman benchmark, a large-scale dataset annotated for gender, age, and 3D mesh in multi-person scenes, and demonstrates strong, state-of-the-art performance across detection, segmentation, pose, and attribute tasks, with notable transferability to unseen HCP tasks. A task-shared Transformer decoder refines a set of Human Queries, while lightweight task-specific heads deliver predictions; GaMS and HQ-Ins Matching further improve cross-task consistency and mesh quality. Overall, HQNet achieves competitive results with task-specific models, shows robust generalization to new tasks and domains, and provides a scalable, efficient pathway toward unified human-centric perception in real-world scenarios.

Abstract

Human-centric perception (e.g. detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at https://github.com/lishuhuai527/COCO-UniHuman.

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

TL;DR

HQNet introduces a unified, single-stage framework for multi-person, multi-task human-centric perception by learning a shared Human Query that encodes instance-specific features at multiple granularities. The approach is evaluated on the COCO-UniHuman benchmark, a large-scale dataset annotated for gender, age, and 3D mesh in multi-person scenes, and demonstrates strong, state-of-the-art performance across detection, segmentation, pose, and attribute tasks, with notable transferability to unseen HCP tasks. A task-shared Transformer decoder refines a set of Human Queries, while lightweight task-specific heads deliver predictions; GaMS and HQ-Ins Matching further improve cross-task consistency and mesh quality. Overall, HQNet achieves competitive results with task-specific models, shows robust generalization to new tasks and domains, and provides a scalable, efficient pathway toward unified human-centric perception in real-world scenarios.

Abstract

Human-centric perception (e.g. detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at https://github.com/lishuhuai527/COCO-UniHuman.
Paper Structure (45 sections, 2 equations, 10 figures, 12 tables)

This paper contains 45 sections, 2 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Multi-person human-centric perception tasks can be categorized into 4 groups: classification, detection, segmentation and pose estimation.
  • Figure 2: Overview of HQNet. HQNet unifies various representative HCP tasks in a single network by learning shared Human Query.
  • Figure 3: Computation cost analysis validates the efficiency of HQNet.
  • Figure 4: Effect of HumanQuery-Instance (HQ-Ins) Matching.
  • Figure A1: Statistics of the COCO-UniHuman benchmark. (a) The gender distribution of COCO-UniHuman is biased towards male. (b) The age distribution ranges from [1, 84] and is biased towards young adults, since images are from public Internet repositories.
  • ...and 5 more figures