Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization
Jiayun Wang, Yubei Chen, Stella X. Yu
TL;DR
The paper tackles the problem of making self-supervised visual representations aware of both object identity and how objects are presented by introducing unlabeled image triplets along viewpoint trajectories and a new SSL benchmark that measures semantic classification and pose estimation concurrently. It shows that mid-layer features are more informative for geometry tasks and introduces a trajectory regularization loss that enforces smoothness of pose trajectories on feature space, yielding improvements in pose estimation without sacrificing semantic accuracy. Empirically, the approach improves pose estimation on in-domain and out-of-domain data and generalizes better to novel objects and poses, with further gains when mid-layer representations are used and compressed. Overall, the work provides a dataset, a practical loss, and evidence that geometry-aware SSL can enhance pose understanding while maintaining semantic richness, with potential impact on robotics, video analysis, and real-world perception systems.
Abstract
Learning visual features from unlabeled images has proven successful for semantic categorization, often by mapping different $views$ of the same object to the same feature to achieve recognition invariance. However, visual recognition involves not only identifying $what$ an object is but also understanding $how$ it is presented. For example, seeing a car from the side versus head-on is crucial for deciding whether to stay put or jump out of the way. While unsupervised feature learning for downstream viewpoint reasoning is important, it remains under-explored, partly due to the lack of a standardized evaluation method and benchmarks. We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory, without any semantic or pose labels. We benchmark both semantic classification and pose estimation accuracies on the same visual feature. Additionally, we propose a viewpoint trajectory regularization loss for learning features from unlabeled image triplets. Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity and organizes objects by their poses, retaining semantic classification accuracy while achieving emergent global pose awareness and better generalization to novel objects. Our dataset and code are available at http://pwang.pw/trajSSL/.
