Table of Contents
Fetching ...

UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao

TL;DR

UniParser tackles multi-human parsing by unifying instance-level and category-level representations within a cosine-space correlation framework. It introduces Center Locator, Instance Feature Space Builder, and Category Feature Space Builder to learn discriminative instance and category features, then fuses them in an end-to-end, NMS-free pipeline that outputs pixel-level parsing. The approach achieves state-of-the-art results on MHPv2.0 and CIHP (e.g., AP$^{p}_{50}$, AP$^{p}_{vol}$, PCP$_{50}$ improvements) while reducing inference time and parameter count. The work demonstrates the effectiveness of joint optimization in a unified representation space and underscores potential for applying correlation learning to other fine-grained, multi-instance vision tasks.

Abstract

Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.

UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

TL;DR

UniParser tackles multi-human parsing by unifying instance-level and category-level representations within a cosine-space correlation framework. It introduces Center Locator, Instance Feature Space Builder, and Category Feature Space Builder to learn discriminative instance and category features, then fuses them in an end-to-end, NMS-free pipeline that outputs pixel-level parsing. The approach achieves state-of-the-art results on MHPv2.0 and CIHP (e.g., AP, AP, PCP improvements) while reducing inference time and parameter count. The work demonstrates the effectiveness of joint optimization in a unified representation space and underscores potential for applying correlation learning to other fine-grained, multi-instance vision tasks.

Abstract

Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.
Paper Structure (23 sections, 7 equations, 8 figures, 9 tables)

This paper contains 23 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Illustration of state-of-the-art network architectures for multi-human parsing. (a) DSPF TianfeiZhou2021DifferentiableMH; (b) AIParsing SanyiZhang2022AIParsingAI; (c) SMP 2023SMP; (d) UniParser (Ours). Yellow: two-stage; Green: single-stage.
  • Figure 2: Overview of UniParser. ResNet and FPN are utilized to obtain multi-scale features sent to the following modules. The Center Locator is responsible for localizing human barycenters. Instance Feature Space Builder and Category Feature Space Builder learn instance and category correlation representations in cosine space, respectively. The fusion module combines instance and category features to predict instance-aware human parsing.
  • Figure 3: Cosine space trained for instance and category feature. $\theta_{inter}$ denotes the cosine distance between different instances or categories. $\theta_{intra}$ denotes the cosine distance between pixel features in the same instance or category.
  • Figure 4: The intermediate results and visualization comparisons in UniParser. (a) denotes the output results of CL; (b) and (c) represent the cosine similarity maps corresponding to the two centers; (d) and (e) are the similarity maps of the existing category "T-shirt" and non-existing "Polo shirt"; (f) and (g) show the similarity matrices of category kernels with and without metric loss; (h) and (i) illustrate visual comparisons between the state-of-the-art method SMP 2023SMP and UniParser: input image (left), SMP (middle), UniParser (right). Best viewed with zoom-in.
  • Figure 5: Structures of different fusion modules.
  • ...and 3 more figures