Table of Contents
Fetching ...

Deep Learning Technique for Human Parsing: A Survey and Outlook

Lu Yang, Wenhe Jia, Shan Li, Qing Song

TL;DR

This survey addresses the rapid growth of human parsing by organizing deep learning methods across SHP, MHP, and VHP, and by detailing datasets and benchmarking results. It introduces a transformer-based baseline, M2FP, that models background, part, and human queries within the Mask2Former framework to unify parsing tasks. The article analyzes current open issues, provides a comprehensive performance panorama across datasets, and discusses future directions including the implications of foundation models for human-centric visual understanding. A regularly updated project page is provided to track ongoing developments in this fast-evolving field.

Abstract

Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.

Deep Learning Technique for Human Parsing: A Survey and Outlook

TL;DR

This survey addresses the rapid growth of human parsing by organizing deep learning methods across SHP, MHP, and VHP, and by detailing datasets and benchmarking results. It introduces a transformer-based baseline, M2FP, that models background, part, and human queries within the Mask2Former framework to unify parsing tasks. The article analyzes current open issues, provides a comprehensive performance panorama across datasets, and discusses future directions including the implications of foundation models for human-centric visual understanding. A regularly updated project page is provided to track ongoing developments in this fast-evolving field.

Abstract

Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
Paper Structure (42 sections, 7 figures, 15 tables)

This paper contains 42 sections, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Human parsing tasks reviewed in this survey: (a) single human parsing (SHP) cvpr2021l2id; (b) multiple human parsing (MHP) gong2018instance; (c) video human parsing (VHP) zhou2018adaptive.
  • Figure 2: Outline of this survey.
  • Figure 3: Timeline of representative human parsing works from 2012 to 2023. The upper part represents the datasets of human parsing (§\ref{['sec:hp-data']}), and the lower part represents the models of human parsing (§\ref{['sec:dl-base-hp']}).
  • Figure 4: Correlations of different SHP, MHP and VHP methods (§\ref{['sec:hp-summary']}). We use the connections between the arc edges to summary the correlation between human parsing methods, each connecting line stands for a study that uses both methods. The longer the arc, the more methods of this kind, same for the width of connecting lines. This correlation summary reveals the prevalence of various human parsing methods.
  • Figure 5: Correlations of different SHP, MHP and VHP studies (§\ref{['sec:hp-summary']}). We list out all the involved human parsing studies by dots and use connecting lines to represent their citing relations. The citing relation here refers to the citation appears in experimental comparisons, to avoid citations of low correlation in background introduction. As each line represents a citation between two studies, so the larger the dot, the more times cited. These correlations highlight the relatively prominent studies.
  • ...and 2 more figures