Table of Contents
Fetching ...

BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation

Uyoung Jeong, Seungryul Baek, Hyung Jin Chang, Kwang In Kim

TL;DR

This work addresses the challenge of disentangling and associating keypoints to individual persons in crowded multi-person pose estimation. It introduces BoIR, a box-supervised instance representation framework that uses a novel Bbox Mask Loss to provide dense, bounding-box–level supervision and couples it with an auxiliary multi-task branch (embedding, bottom-up keypoints, bbox, and center heads) for richer, globally consistent features without increasing inference cost. The approach achieves state-of-the-art gains across COCO val/test-dev (+0.8 AP), CrowdPose (+4.9 AP), and OCHuman (+3.5 AP), demonstrating strong performance in crowded scenes and occlusion. The results suggest that bounding-box supervision can effectively complement keypoint supervision to produce robust, disentangled instance representations, with potential extensions to additional auxiliary tasks and multi-modal signals. Overall, BoIR offers a practical, scalable path to improve single-stage MPPE in challenging real-world scenarios.

Abstract

Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR

BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation

TL;DR

This work addresses the challenge of disentangling and associating keypoints to individual persons in crowded multi-person pose estimation. It introduces BoIR, a box-supervised instance representation framework that uses a novel Bbox Mask Loss to provide dense, bounding-box–level supervision and couples it with an auxiliary multi-task branch (embedding, bottom-up keypoints, bbox, and center heads) for richer, globally consistent features without increasing inference cost. The approach achieves state-of-the-art gains across COCO val/test-dev (+0.8 AP), CrowdPose (+4.9 AP), and OCHuman (+3.5 AP), demonstrating strong performance in crowded scenes and occlusion. The results suggest that bounding-box supervision can effectively complement keypoint supervision to produce robust, disentangled instance representations, with potential extensions to additional auxiliary tasks and multi-modal signals. Overall, BoIR offers a practical, scalable path to improve single-stage MPPE in challenging real-world scenarios.

Abstract

Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR
Paper Structure (17 sections, 9 equations, 6 figures, 7 tables)

This paper contains 17 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a): Bbox Mask Loss framework. Blue dot is a query box(blue colour) center, while red dot is another box center. $\mathcal{L}_{pull}^{in}$ pulls instance center and soft-masked mean embeddings inside the box, $\mathcal{L}_{push}^{out}$ pushes pairwise instance-background embeddings, and $\mathcal{L}_{push}^{inst}$ pushes pairwise instance embeddings. (b)-(f): Visualization of feature similarities from the center features of bounding boxes in (b). (c) and (d) are CID feature similarities from A and B centers, respectively, while (e) and (f) are BoIR feature similarities.
  • Figure 2: Left: Overview of our framework. Instance keypoint (kpt) head and center head are primary regression heads. bottom-up keypoint (buk) head, bounding box (bbox) head and embedding (emb) head are auxiliary task regressors which are not used during inference. Right: Layer composition of instance keypoint head. 'Linear': linear layer, 'Conv': convolution layer, 'LN': Layer Normalization, 'IN': Instance Normalization, '$\otimes$': Hadamard product. '$coord$': relative coordinates of the heatmap pixel indices. $f'\in\mathbb{R}^{C'\times H\times W}$: projection of $f$ by single convolution layer.
  • Figure 3: Example outcomes using our approach. The image on the left is from the COCO val set, while the image on the right is from the CrowdPose test set. We employed t-SNE, running it for 250 iterations, on the output backbone feature, with three output dimensions per pixel, corresponding directly to normalized RGB values.
  • Figure 4: Comparative visualization on COCO val set.
  • Figure 5: Comparative visualization on CrowdPose test set.
  • ...and 1 more figures