A Dataset and Framework for Learning State-invariant Object Representations
Rohan Sarkar, Avinash Kak
TL;DR
The paper introduces the ObjectsWithStateChange (OWSC) dataset to study state-invariant object representations under state and pose variations from arbitrary viewpoints. It extends the PiRO framework with a curriculum-based informative-sample mining strategy and a dual-encoder architecture to jointly learn category- and object-level embeddings that remain discriminative for fine-grained retrieval. Empirical results show state-of-the-art performance on OWSC and improved generalization to other multi-view datasets, driven by curriculum-driven sampling and deeper attention-based aggregation. The dataset includes 21 categories, 331 objects, and 11,328 images with rich state-change annotations and text descriptions, enabling eight state-invariant tasks and potential multimodal applications.
Abstract
We add one more invariance - the state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the objects, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. In this work, we present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of 3D objects that are capable of state changes. The goal of such research would be to train models capable of learning discriminative object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. A major challenge in this regard is that instances of different objects (both within and across different categories) under various state changes may share similar visual characteristics and therefore may be close to one another in the learned embedding space, which would make it more difficult to discriminate between them. To address this, we propose a curriculum learning strategy that progressively selects object pairs with smaller inter-object distances in the learned embedding space during the training phase. This approach gradually samples harder-to-distinguish examples of visually similar objects, both within and across different categories. Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset, as well as three other challenging multi-view datasets such as ModelNet40, ObjectPI, and FG3D.
