Table of Contents
Fetching ...

DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Yonghao Dang, Jianqin Yin, Liyuan Liu, Pengxiang Ding, Yuan Sun, Yanzhu Hu

TL;DR

DHRNet tackles multi-person pose estimation by jointly modeling cross-instance and cross-joint interactions through a Dual-path Hierarchical Relation Network. The core is the Dual-path Interaction Modeling Module (DIM), which comprises cross-instance (CIM) and cross-joint (CJM) blocks and adaptive feature fusion (ADFMs) to produce robust instance- and joint-aware representations that feed a pose decoder. Empirical results across COCO, CrowdPose, and OCHuman demonstrate state-of-the-art performance among single-stage methods, with notable gains in occlusion-heavy and crowded scenes, while ablations validate the contribution of each DIM component. The approach offers a scalable, end-to-end framework that leverages complementary relational information to improve joint localization and pose estimation in challenging MPPE scenarios.

Abstract

Multi-person pose estimation (MPPE) presents a formidable yet crucial challenge in computer vision. Most existing methods predominantly concentrate on isolated interaction either between instances or joints, which is inadequate for scenarios demanding concurrent localization of both instances and joints. This paper introduces a novel CNN-based single-stage method, named Dual-path Hierarchical Relation Network (DHRNet), to extract instance-to-joint and joint-to-instance interactions concurrently. Specifically, we design a dual-path interaction modeling module (DIM) that strategically organizes cross-instance and cross-joint interaction modeling modules in two complementary orders, enriching interaction information by integrating merits from different correlation modeling branches. Notably, DHRNet excels in joint localization by leveraging information from other instances and joints. Extensive evaluations on challenging datasets, including COCO, CrowdPose, and OCHuman datasets, showcase DHRNet's state-of-the-art performance. The code will be released at https://github.com/YHDang/dhrnet-multi-pose-estimation.

DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

TL;DR

DHRNet tackles multi-person pose estimation by jointly modeling cross-instance and cross-joint interactions through a Dual-path Hierarchical Relation Network. The core is the Dual-path Interaction Modeling Module (DIM), which comprises cross-instance (CIM) and cross-joint (CJM) blocks and adaptive feature fusion (ADFMs) to produce robust instance- and joint-aware representations that feed a pose decoder. Empirical results across COCO, CrowdPose, and OCHuman demonstrate state-of-the-art performance among single-stage methods, with notable gains in occlusion-heavy and crowded scenes, while ablations validate the contribution of each DIM component. The approach offers a scalable, end-to-end framework that leverages complementary relational information to improve joint localization and pose estimation in challenging MPPE scenarios.

Abstract

Multi-person pose estimation (MPPE) presents a formidable yet crucial challenge in computer vision. Most existing methods predominantly concentrate on isolated interaction either between instances or joints, which is inadequate for scenarios demanding concurrent localization of both instances and joints. This paper introduces a novel CNN-based single-stage method, named Dual-path Hierarchical Relation Network (DHRNet), to extract instance-to-joint and joint-to-instance interactions concurrently. Specifically, we design a dual-path interaction modeling module (DIM) that strategically organizes cross-instance and cross-joint interaction modeling modules in two complementary orders, enriching interaction information by integrating merits from different correlation modeling branches. Notably, DHRNet excels in joint localization by leveraging information from other instances and joints. Extensive evaluations on challenging datasets, including COCO, CrowdPose, and OCHuman datasets, showcase DHRNet's state-of-the-art performance. The code will be released at https://github.com/YHDang/dhrnet-multi-pose-estimation.
Paper Structure (30 sections, 10 equations, 9 figures, 5 tables)

This paper contains 30 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Interactive information in multi-person scenario. The solid double arrows represent cross-instance interaction. The dashed double arrows denote cross-joint interaction. Most existing methods model the interactive information with a one-way order, which ignores the complementarity between cross-instance and cross-joint interactions. Our method makes full use of the complementarity of these two interactions through bidirectional correlation modeling.
  • Figure 2: Overview of the proposed DHRNet. For a given input image, a feature encoder is used to extract visual features $F$. Then, an instance decoder and a keypoints decoder are used to generate instance masks $F_{mask}$ and joint representations $F_{joint}$. Instance masks are fused with visual to get instance representations $F_{inst}$. DIM takes $F_{inst}$ and $F_{joint}$ as input to model the correlations. Finally, a pose decoder aggregates relation-based features extracted by DIM to estimate poses.
  • Figure 3: Structure of the proposed cross-instance interaction modeling module.
  • Figure 4: Structure of the proposed cross-joint interaction modeling module.
  • Figure 5: Visualization about center maps and correlations on the COCO val dataset. (a) represents the output of CID. (b) denotes the output of our DHRNet. (c) and (d) are cross-instance correlations in the IJR and JIR branches. In Figure (d), the x-axis and y-axis represent the index of each person's proposal.
  • ...and 4 more figures