Table of Contents
Fetching ...

Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval

Rohan Sarkar, Avinash Kak

TL;DR

An attention-based dual-encoder architecture with specially designed loss functions that optimize the inter-and intra-class distances simultaneously in two different embedding spaces, one for the category embeddings and the other for the object level embeddings is presented.

Abstract

In the context of pose-invariant object recognition and retrieval, we demonstrate that it is possible to achieve significant improvements in performance if both the category-based and the object-identity-based embeddings are learned simultaneously during training. In hindsight, that sounds intuitive because learning about the categories is more fundamental than learning about the individual objects that correspond to those categories. However, to the best of what we know, no prior work in pose-invariant learning has demonstrated this effect. This paper presents an attention-based dual-encoder architecture with specially designed loss functions that optimize the inter- and intra-class distances simultaneously in two different embedding spaces, one for the category embeddings and the other for the object-level embeddings. The loss functions we have proposed are pose-invariant ranking losses that are designed to minimize the intra-class distances and maximize the inter-class distances in the dual representation spaces. We demonstrate the power of our approach with three challenging multi-view datasets, ModelNet-40, ObjectPI, and FG3D. With our dual approach, for single-view object recognition, we outperform the previous best by 20.0% on ModelNet40, 2.0% on ObjectPI, and 46.5% on FG3D. On the other hand, for single-view object retrieval, we outperform the previous best by 33.7% on ModelNet40, 18.8% on ObjectPI, and 56.9% on FG3D.

Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval

TL;DR

An attention-based dual-encoder architecture with specially designed loss functions that optimize the inter-and intra-class distances simultaneously in two different embedding spaces, one for the category embeddings and the other for the object level embeddings is presented.

Abstract

In the context of pose-invariant object recognition and retrieval, we demonstrate that it is possible to achieve significant improvements in performance if both the category-based and the object-identity-based embeddings are learned simultaneously during training. In hindsight, that sounds intuitive because learning about the categories is more fundamental than learning about the individual objects that correspond to those categories. However, to the best of what we know, no prior work in pose-invariant learning has demonstrated this effect. This paper presents an attention-based dual-encoder architecture with specially designed loss functions that optimize the inter- and intra-class distances simultaneously in two different embedding spaces, one for the category embeddings and the other for the object-level embeddings. The loss functions we have proposed are pose-invariant ranking losses that are designed to minimize the intra-class distances and maximize the inter-class distances in the dual representation spaces. We demonstrate the power of our approach with three challenging multi-view datasets, ModelNet-40, ObjectPI, and FG3D. With our dual approach, for single-view object recognition, we outperform the previous best by 20.0% on ModelNet40, 2.0% on ObjectPI, and 46.5% on FG3D. On the other hand, for single-view object retrieval, we outperform the previous best by 33.7% on ModelNet40, 18.8% on ObjectPI, and 56.9% on FG3D.
Paper Structure (13 sections, 10 equations, 19 figures, 6 tables)

This paper contains 13 sections, 10 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: The upper panel shows objects belonging to two different categories, chair and stool. In the proposed disentangled dual-space learning, the goal for the learning of category-based embeddings is to capture what maximally discriminates the objects belonging to the two categories --- the presence or the absence of the back-rest. On the other hand, the object-identity based embeddings are meant to capture what is distinctive about each object. The lower panel illustrates our dual-space approach for simultaneously learning the embeddings in two different spaces for category and object-identity-based recognition and retrieval tasks.
  • Figure 2: An overview of our PiRO framework to learn the dual pose-invariant object and category embeddings using losses specifically designed for each embedding space. Multi-view images of two randomly chosen objects from the same category are used to learn common characteristics of the objects in the category embedding space and discriminatory attributes to distinguish between them in the object embedding space.
  • Figure 3: The Pose-invariant Attention Network (PAN) takes a set of multi-view images of an object as input, producing both single-view and multi-view embeddings for each representational subspace. The object embeddings are depicted in orange, while the category embeddings are in blue.
  • Figure 4: The pose-invariant losses enhance intra-class compactness and inter-class separation in the dual embedding spaces. In the object embedding space (top), confusing instances of two different objects from the same category are separated. In the category embedding space (bottom), objects belonging to the same category are pulled closer while being separated from those belonging to other categories.
  • Figure 5: We show UMAP UMAP visualizations for a qualitative comparison of the object embedding space learned for the ModelNet40 test dataset (from 5 categories such as table, desk, chair, stool, and sofa with 100 objects) by prior pose-invariant methods PIE2019 and our method. Each instance is an object view and a unique color and shape is used to denote each object-identity class in the visualizations.
  • ...and 14 more figures