A Dataset and Framework for Learning State-invariant Object Representations

Rohan Sarkar; Avinash Kak

A Dataset and Framework for Learning State-invariant Object Representations

Rohan Sarkar, Avinash Kak

TL;DR

The paper introduces the ObjectsWithStateChange (OWSC) dataset to study state-invariant object representations under state and pose variations from arbitrary viewpoints. It extends the PiRO framework with a curriculum-based informative-sample mining strategy and a dual-encoder architecture to jointly learn category- and object-level embeddings that remain discriminative for fine-grained retrieval. Empirical results show state-of-the-art performance on OWSC and improved generalization to other multi-view datasets, driven by curriculum-driven sampling and deeper attention-based aggregation. The dataset includes 21 categories, 331 objects, and 11,328 images with rich state-change annotations and text descriptions, enabling eight state-invariant tasks and potential multimodal applications.

Abstract

We add one more invariance - the state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the objects, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. In this work, we present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of 3D objects that are capable of state changes. The goal of such research would be to train models capable of learning discriminative object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. A major challenge in this regard is that instances of different objects (both within and across different categories) under various state changes may share similar visual characteristics and therefore may be close to one another in the learned embedding space, which would make it more difficult to discriminate between them. To address this, we propose a curriculum learning strategy that progressively selects object pairs with smaller inter-object distances in the learned embedding space during the training phase. This approach gradually samples harder-to-distinguish examples of visually similar objects, both within and across different categories. Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset, as well as three other challenging multi-view datasets such as ModelNet40, ObjectPI, and FG3D.

A Dataset and Framework for Learning State-invariant Object Representations

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 12 figures, 6 tables)

This paper contains 22 sections, 2 equations, 12 figures, 6 tables.

Introduction
Related Work
The ObjectsWithStateChange Dataset
Dataset Design and Collection
Benchmarking using ObjectsWithStateChange
Proposed Framework for Learning Invariant Embeddings
The Dual-Encoder Architecture
Losses
Using Curriculum Learning as a Mining Strategy
Informative Sample Mining and Training:
Implementation Details:
Experimental Results
Evaluation using ObjectsWithStateChange
Ablation Studies
Conclusion
...and 7 more sections

Figures (12)

Figure 1: In addition to pose, many commonly occurring objects also exhibit significant changes in their appearance when their state changes. This figure displays several examples of such objects in our ObjectsWithStateChange (OWSC) dataset.
Figure 2: Figs. (a)-(e) show a few examples of visually similar objects from different categories in our dataset for fine-grained recognition and retrieval tasks in the top three rows. Fig. (f) shows samples of pairs of visually similar objects from the same category under various state and pose changes captured from arbitrary views in the bottom two rows.
Figure 3: This figure shows samples from the Train and Test splits of our dataset for different objects. The state of the object as well as the background and pose are different in each split for every object. The images are captured from arbitrary viewpoints.
Figure 4: This figure shows the images in various states and poses captured from arbitrary viewpoints, category label, object label, and text description for each object of the OWSC dataset.
Figure 5: An overview of the ranking losses from PiRO described in Sec. \ref{['sec:losses']} for the object space (top) and the category space (bottom) that we use in our work to learn invariant embeddings.
...and 7 more figures

A Dataset and Framework for Learning State-invariant Object Representations

TL;DR

Abstract

A Dataset and Framework for Learning State-invariant Object Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)