Table of Contents
Fetching ...

PANDA: Pose Aligned Networks for Deep Attribute Modeling

Ning Zhang, Manohar Paluri, Marc'Aurelio Ranzato, Trevor Darrell, Lubomir Bourdev

TL;DR

The paper tackles the challenge of predicting human attributes under significant pose, viewpoint, and occlusion variation. It introduces PANDA, a hybrid architecture that trains CNNs on semantically aligned body-part patches (poselets) to produce pose-normalized features, which are then combined with a whole-image CNN and linearly classified per attribute. Empirical results on the Berkeley Attributes of People dataset and the Attributes25K dataset show PANDA achieving state-of-the-art performance, outpacing traditional part-based methods and generic CNN baselines, and it also demonstrates strong performance on the LFW gender task. The approach highlights the benefit of integrating mid-level part localization with deep learning to reduce data requirements while handling pose variation, with potential extensions to related tasks such as detection and pose estimation.

Abstract

We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by shallow low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.

PANDA: Pose Aligned Networks for Deep Attribute Modeling

TL;DR

The paper tackles the challenge of predicting human attributes under significant pose, viewpoint, and occlusion variation. It introduces PANDA, a hybrid architecture that trains CNNs on semantically aligned body-part patches (poselets) to produce pose-normalized features, which are then combined with a whole-image CNN and linearly classified per attribute. Empirical results on the Berkeley Attributes of People dataset and the Attributes25K dataset show PANDA achieving state-of-the-art performance, outpacing traditional part-based methods and generic CNN baselines, and it also demonstrates strong performance on the LFW gender task. The approach highlights the benefit of integrating mid-level part localization with deep learning to reduce data requirements while handling pose variation, with potential extensions to related tasks such as detection and pose estimation.

Abstract

We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by shallow low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.

Paper Structure

This paper contains 17 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of Pose Aligned Networks for Deep Attribute modeling (PANDA). One convolutional neural net is trained on semantic part patches for each poselet and then the top-level activations of all nets are concatenated to obtain a pose-normalized deep representation. The final attributes are predicted by linear SVM classifier using the pose-normalized representations.
  • Figure 2: Part-based Convolutional Neural Nets. For each poselet, one convolutional neural net is trained on patches resized 64x64. The network consists of 4 stages of convolution/pooling/normalization and followed by a fully connected layer. Then, it branches out one fully connected layer with 128 hidden units for each attribute. We concatenate the activation from fc_attr from each poselet network to obtain the pose-normalized representation. The details of filter size, number of filters we used are depicted above.
  • Figure 3: Poselet Input Patches from Berkeley Attributes of People Dataset. For each poselet, we use the detected patches to train a convolution neural net. Here are some examples of input poselet patches and we are showing poselet patches with high scores for poselet 1,16 and 79.
  • Figure 4: Statisitcs of the number of groundtruth labels on Attribute 25k Dataset. For each attribute, green is the number of positive labels, red is the number of negative labels and yellow is the number of uncertain labels.
  • Figure 5: Example of failure cases on the Berkeley Attributes of People test dataset.
  • ...and 2 more figures