Unseen Object Reasoning with Shared Appearance Cues

Paridhi Singh; Arun Kumar

Unseen Object Reasoning with Shared Appearance Cues

Paridhi Singh, Arun Kumar

TL;DR

The paper tackles open world recognition by representing objects as constellations of mid-level appearance cues derived from patch-level ViT features. A semantic prior over clusters, $S ∈ ℝ^{G×K}$, is learned from known classes and used to compute a test image semantic vector $P_{I^t}$ via $P_{I^t} = \frac{1}{M^2} \sum_m S(D_k^m)$, enabling inference of unseen objects' similarity to known categories and their superclasses. They demonstrate that a finite set of mid-level cues suffices to model both seen and unseen objects on CIFAR100 and ImageNet 64×64, performing superclass reasoning without full supervision. This approach provides robust open-world reasoning and has practical potential for real-world recognition tasks by enabling ongoing reasoning about novel objects without exhaustively labeling all categories.

Abstract

This paper introduces an innovative approach to open world recognition (OWR), where we leverage knowledge acquired from known objects to address the recognition of previously unseen objects. The traditional method of object modeling relies on supervised learning with strict closed-set assumptions, presupposing that objects encountered during inference are already known at the training phase. However, this assumption proves inadequate for real-world scenarios due to the impracticality of accounting for the immense diversity of objects. Our hypothesis posits that object appearances can be represented as collections of "shareable" mid-level features, arranged in constellations to form object instances. By adopting this framework, we can efficiently dissect and represent both known and unknown objects in terms of their appearance cues. Our paper introduces a straightforward yet elegant method for modeling novel or unseen objects, utilizing established appearance cues and accounting for inherent uncertainties. This representation not only enables the detection of out-of-distribution objects or novel categories among unseen objects but also facilitates a deeper level of reasoning, empowering the identification of the superclass to which an unknown instance belongs. This novel approach holds promise for advancing open world recognition in diverse applications.

Unseen Object Reasoning with Shared Appearance Cues

TL;DR

The paper tackles open world recognition by representing objects as constellations of mid-level appearance cues derived from patch-level ViT features. A semantic prior over clusters,

, is learned from known classes and used to compute a test image semantic vector

via

, enabling inference of unseen objects' similarity to known categories and their superclasses. They demonstrate that a finite set of mid-level cues suffices to model both seen and unseen objects on CIFAR100 and ImageNet 64×64, performing superclass reasoning without full supervision. This approach provides robust open-world reasoning and has practical potential for real-world recognition tasks by enabling ongoing reasoning about novel objects without exhaustively labeling all categories.

Abstract

Paper Structure (11 sections, 3 equations, 6 figures, 2 tables)

This paper contains 11 sections, 3 equations, 6 figures, 2 tables.

Introduction
Related Work
Model
Fix Notations
Dataset Representation
Appearance Based Grouping
Inference
Datasets
Implementation Details
Results and Discussion
Conclusion & Future Work

Figures (6)

Figure 1: If human brains can successfully reason novel objects, why let the networks fail?
Figure 2: Visualization of our appearance + positional embedded clustering: Each block comprises patches that belong to an appearance cluster. CIFAR100 with 112x112 as patch size
Figure 3: T-SNE feature plots for randomly generated data points from CIFAR100. left: clustered data points vs.right: semantic labels of the data points. Appearance vectors of each patch is represented as a data point. It is evident that given same number of cluster used, there is a significant levels of entropy in the grouping (right side) when using semantic labels as opposed to the appearance based clustering alone.
Figure 4: Optimal K via elbow method for Cifar100 & imagenet datasets
Figure 5: CIFAR100: Number of clusters vs top-1 and top-2 accuracies
...and 1 more figures

Unseen Object Reasoning with Shared Appearance Cues

TL;DR

Abstract

Unseen Object Reasoning with Shared Appearance Cues

Authors

TL;DR

Abstract

Table of Contents

Figures (6)