Table of Contents
Fetching ...

Multiple-Human Parsing in the Wild

Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, Jiashi Feng

TL;DR

<3-5 sentence high-level summary> The paper introduces the Multi-Human Parsing in the Wild (MHP) problem and provides a new large-scale dataset with pixel-level, instance-aware annotations for multiple people in realistic scenes. It proposes MH-Parser, a bottom-up model that jointly predicts a global instance-agnostic parsing map and a learned pairwise affinity graph, refined by a Graph-GAN and CRF-based post-processing to produce accurate multi-person parsing. Key innovations include Graph-GAN for learning high-order affinities on graphs of superpixels and a CRF refinement that integrates learned affinities with appearance and spatial cues. The method achieves competitive results with state-of-the-art baselines and demonstrates strong handling of closely entangled humans, establishing a solid baseline for future multi-human parsing research in the wild.

Abstract

Human parsing is attracting increasing research attention. In this work, we aim to push the frontier of human parsing by introducing the problem of multi-human parsing in the wild. Existing works on human parsing mainly tackle single-person scenarios, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address the multi-human parsing problem, we introduce a new multi-human parsing (MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP dataset contains multiple persons captured in real-world scenes with pixel-level fine-grained semantic annotations in an instance-aware setting. The MH-Parser generates global parsing maps and person instance masks simultaneously in a bottom-up fashion with the help of a new Graph-GAN model. We envision that the MHP dataset will serve as a valuable data resource to develop new multi-human parsing models, and the MH-Parser offers a strong baseline to drive future research for multi-human parsing in the wild.

Multiple-Human Parsing in the Wild

TL;DR

<3-5 sentence high-level summary> The paper introduces the Multi-Human Parsing in the Wild (MHP) problem and provides a new large-scale dataset with pixel-level, instance-aware annotations for multiple people in realistic scenes. It proposes MH-Parser, a bottom-up model that jointly predicts a global instance-agnostic parsing map and a learned pairwise affinity graph, refined by a Graph-GAN and CRF-based post-processing to produce accurate multi-person parsing. Key innovations include Graph-GAN for learning high-order affinities on graphs of superpixels and a CRF refinement that integrates learned affinities with appearance and spatial cues. The method achieves competitive results with state-of-the-art baselines and demonstrates strong handling of closely entangled humans, establishing a solid baseline for future multi-human parsing research in the wild.

Abstract

Human parsing is attracting increasing research attention. In this work, we aim to push the frontier of human parsing by introducing the problem of multi-human parsing in the wild. Existing works on human parsing mainly tackle single-person scenarios, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address the multi-human parsing problem, we introduce a new multi-human parsing (MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP dataset contains multiple persons captured in real-world scenes with pixel-level fine-grained semantic annotations in an instance-aware setting. The MH-Parser generates global parsing maps and person instance masks simultaneously in a bottom-up fashion with the help of a new Graph-GAN model. We envision that the MHP dataset will serve as a valuable data resource to develop new multi-human parsing models, and the MH-Parser offers a strong baseline to drive future research for multi-human parsing in the wild.

Paper Structure

This paper contains 27 sections, 18 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Annotation examples for our constructed Multiple Human Parsing (MHP) dataset (c) and other existing datasets for human parsing (a: ATR liang2015deep; b: Look into Person (LIP) gong2017look). In (c), rectangles in different colors indicate distinct person instances. ATR contains images of single persons with upright position; LIP includes more pose variations, but still only contains a single person in each image. The MHP dataset provides images with fine-grained annotations for multiple persons with interaction, occlusion and various poses, aligning better with real-world scenarios.
  • Figure 2: Examples and statistics of the MHP dataset. Left: An annotated example for multi-human parsing. Middle: Statistics on number of persons in one image. Right: The data distribution on 18 semantic part labels in the MHP dataset.
  • Figure 3: Architecture overview of the proposed Multiple Human Parser (MH-Parser). Here $\mathbf{M}$ refers to the global accordance map, $\mathbf{A}$ refers to the ground truth pairwise affinity map and $\bar{\mathbf{A}}$ denotes the predictions. $\mathbf{A}$ is obtained by rule-based mapping from $\mathbf{M}$ and the corresponding superpixel map (see Eqn. \ref{['eqn:gt_pam']} and \ref{['eqn:majority_vote']}), and $\bar{\mathbf{A}}$ is the output of the graph generator (consisting of the representation learner and the affinity prediction net). The graph convolution discriminator takes the affinity graph from the graph generator as input and predicts whether it is a ground truth or a prediction. Fusing the predicted instance-agnostic parsing map and instance masks (constructed from $\bar{\mathbf{A}}$) gives the instance-aware parsing results.
  • Figure 4: Visualization of parsing results. For each (a) input image, we show the (b) parsing ground truth, (c) global parsing prediction, person instance map predictions from (d) Mask RCNN, (e) DL and (f) MH-Parser. In (b) and (c), each color represents a semantic parsing category. In (d), (e) and (f), each color represents one person instance. We can see the proposed MH-Parser can generate satisfactory global parsing, and outperforms Mask RCNN and DL when persons are closely entangled.