Table of Contents
Fetching ...

Bird Species Categorization Using Pose Normalized Deep Convolutional Nets

Steve Branson, Grant Van Horn, Serge Belongie, Pietro Perona

TL;DR

The paper tackles fine-grained bird species recognition by introducing pose-normalized region extraction guided by learned pose prototypes and integrating multi-layer CNN features. It demonstrates that a similarity-based warping approach, combined with prototype learning via a facility-location formulation, and fine-tuned CNNs across multiple layers yield substantial accuracy gains on CUB-200-2011. Key contributions include a general pose-normalization framework, an efficient prototype-learning method, and empirical analyses showing where different CNN layers and alignment schemes best complement each other. The approach achieves state-of-the-art performance with a notable improvement over prior methods, underscoring the value of combining pose-aware representations with deep feature learning for fine-grained classification.

Abstract

We propose an architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species. Our architecture first computes an estimate of the object's pose; this is used to compute local image features which are, in turn, used for classification. The features are computed by applying deep convolutional nets to image patches that are located and normalized by the pose. We perform an empirical study of a number of pose normalization schemes, including an investigation of higher order geometric warping functions. We propose a novel graph-based clustering algorithm for learning a compact pose normalization space. We perform a detailed investigation of state-of-the-art deep convolutional feature implementations and fine-tuning feature learning for fine-grained classification. We observe that a model that integrates lower-level feature layers with pose-normalized extraction routines and higher-level feature layers with unaligned image features works best. Our experiments advance state-of-the-art performance on bird species recognition, with a large improvement of correct classification rates over previous methods (75% vs. 55-65%).

Bird Species Categorization Using Pose Normalized Deep Convolutional Nets

TL;DR

The paper tackles fine-grained bird species recognition by introducing pose-normalized region extraction guided by learned pose prototypes and integrating multi-layer CNN features. It demonstrates that a similarity-based warping approach, combined with prototype learning via a facility-location formulation, and fine-tuned CNNs across multiple layers yield substantial accuracy gains on CUB-200-2011. Key contributions include a general pose-normalization framework, an efficient prototype-learning method, and empirical analyses showing where different CNN layers and alignment schemes best complement each other. The approach achieves state-of-the-art performance with a notable improvement over prior methods, underscoring the value of combining pose-aware representations with deep feature learning for fine-grained classification.

Abstract

We propose an architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species. Our architecture first computes an estimate of the object's pose; this is used to compute local image features which are, in turn, used for classification. The features are computed by applying deep convolutional nets to image patches that are located and normalized by the pose. We perform an empirical study of a number of pose normalization schemes, including an investigation of higher order geometric warping functions. We propose a novel graph-based clustering algorithm for learning a compact pose normalization space. We perform a detailed investigation of state-of-the-art deep convolutional feature implementations and fine-tuning feature learning for fine-grained classification. We observe that a model that integrates lower-level feature layers with pose-normalized extraction routines and higher-level feature layers with unaligned image features works best. Our experiments advance state-of-the-art performance on bird species recognition, with a large improvement of correct classification rates over previous methods (75% vs. 55-65%).

Paper Structure

This paper contains 15 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Pipeline Overview: Given a test image, we use groups of detected keypoints to compute multiple warped image regions that are aligned with prototypical models. Each region is fed through a deep convolutional network, and features are extracted from multiple layers. Features are concatenated and fed to a classifier.
  • Figure 2: Example Warped Regions: The top row visualizes different prototypes, each of which defines a region of interest and multiple keypoints that are used to estimate a warping. The bottom rows show the resulting warped regions $X(w_{tp}^*)$ when 5 images are aligned with each prototype. The 4 groupings of warped regions represent 4 baseline experiments analyzed in Table \ref{['tab:combine']}, which includes 1) Hand-Defined head or body regions, 2) the 1st 3 prototypes learned using our method from Section \ref{['sec:pose_learn']}, 3) Rand-Pairs, which simulates bergpoof, 4) CUB-Keypoints, which simulates branson2014ignorant. In general, we see that a similarity transform captures the scale/orientation of an object better than a translation, while an affine transformation sometimes overly distorts the image. Using more points to estimate the warping allows for non-visible keypoints and ambiguous image flipping problems to be handled consistently.
  • Figure 3: Effect of features and region type on CUB-200-2011:(a) CNN features significantly outperform HOG and Fisher features for all levels of alignment (image, bounding box, head). (b) Comparing classification performance for different CNN layers and regions if we assume ground truth part locations are known at test time, we see that 1) features extracted from the head (yellow tube) significantly outperform other regions, 2) The later fully connected layers (fc6 & fc7) significantly outperform earlier layers when a crude alignment model is used (image-level alignment), whereas convolutional layers (conv5) begin to dominate performance as we move to a stronger alignment model (from image $\to$ bbox $\to$ body $\to$ head), 3) Using a similarity warping model significantly outperforms a translation model (width of the red and yellow tubes), and slightly outperforms an affine model, 4) Using more points (from 1 to 5) to estimate the warping improves performance for the body, whereas 2 points is sufficient for the head.
  • Figure 4: Effect of fine-tuning and ground truth parts on CUB-200-2011:(a) If ground truth parts were available at test time or part detection could be improved, performance would be improved significantly (width of red/yellow tubes). (b) Fine-tuning significantly improves performance for all alignment levels (width of each tube). Improvements occur for all CNN layers; however, the effect is largest for fully connected layers. (c) The same effect holds for automated part prediction.