Part-based R-CNNs for Fine-grained Category Detection
Ning Zhang, Jeff Donahue, Ross Girshick, Trevor Darrell
TL;DR
The paper tackles fine-grained categorization by localizing object parts and modeling their geometry to normalize pose. It introduces Part-based R-CNNs, which learn object and semantic part detectors on bottom-up region proposals and enforce geometric constraints to produce a pose-normalized representation for classification. The approach achieves state-of-the-art results on Caltech-UCSD birds (CUB-200-2011) even without test-time bounding boxes, aided by CNN feature extraction and targeted fine-tuning. Ablation and analysis demonstrate the benefits of non-parametric geometric priors and reveal sensitivities to hyperparameters, pointing to future work in joint part-category learning and weakly supervised part discovery.
Abstract
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
