Table of Contents
Fetching ...

Part-based R-CNNs for Fine-grained Category Detection

Ning Zhang, Jeff Donahue, Ross Girshick, Trevor Darrell

TL;DR

The paper tackles fine-grained categorization by localizing object parts and modeling their geometry to normalize pose. It introduces Part-based R-CNNs, which learn object and semantic part detectors on bottom-up region proposals and enforce geometric constraints to produce a pose-normalized representation for classification. The approach achieves state-of-the-art results on Caltech-UCSD birds (CUB-200-2011) even without test-time bounding boxes, aided by CNN feature extraction and targeted fine-tuning. Ablation and analysis demonstrate the benefits of non-parametric geometric priors and reveal sensitivities to hyperparameters, pointing to future work in joint part-category learning and weakly supervised part discovery.

Abstract

Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

Part-based R-CNNs for Fine-grained Category Detection

TL;DR

The paper tackles fine-grained categorization by localizing object parts and modeling their geometry to normalize pose. It introduces Part-based R-CNNs, which learn object and semantic part detectors on bottom-up region proposals and enforce geometric constraints to produce a pose-normalized representation for classification. The approach achieves state-of-the-art results on Caltech-UCSD birds (CUB-200-2011) even without test-time bounding boxes, aided by CNN feature extraction and targeted fine-tuning. Ablation and analysis demonstrate the benefits of non-parametric geometric priors and reveal sensitivities to hyperparameters, pointing to future work in joint part-category learning and weakly supervised part discovery.

Abstract

Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our part localization Starting from bottom-up region proposals (top-left), we train both object and part detectors based on deep convolutional features. During test time, all the windows are scored by all detectors (middle), and we apply non-parametric geometric constraints (bottom) to rescore the windows and choose the best object and part detections (top-right). The final step is to extract features on the localized semantic parts for fine-grained recognition for a pose-normalized representation and then train a classifier for the final categorization. Best viewed in color.
  • Figure 2: Illustration of geometric constant $\delta^{NP}$. In each row, the first column is the test image with an R-CNN bounding box detection, and the rest are the top-five nearest neighbors in the training set, indexed using pool5 features and cosine distance metric.
  • Figure 3: Cross-validation results on fine-grained accuracy for different values of $\alpha$ (left) and $K$ (right). We split the training data into 5 folds and use cross-validate each hyperparameter setting.
  • Figure 4: Examples of bird detection and part localization from strong DPM Hossein_ECCV12 (left); our method using $\Delta_{\mathrm{box}}$ part predictions (middle); and our method using $\delta^{NP}$(right). All detection and localization results without any assumption of bounding box.
  • Figure 5: Failure cases of our part localization using $\delta^{NP}$.