Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

Jiayun Wang; Yubei Chen; Stella X. Yu

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

Jiayun Wang, Yubei Chen, Stella X. Yu

TL;DR

The paper tackles the problem of making self-supervised visual representations aware of both object identity and how objects are presented by introducing unlabeled image triplets along viewpoint trajectories and a new SSL benchmark that measures semantic classification and pose estimation concurrently. It shows that mid-layer features are more informative for geometry tasks and introduces a trajectory regularization loss that enforces smoothness of pose trajectories on feature space, yielding improvements in pose estimation without sacrificing semantic accuracy. Empirically, the approach improves pose estimation on in-domain and out-of-domain data and generalizes better to novel objects and poses, with further gains when mid-layer representations are used and compressed. Overall, the work provides a dataset, a practical loss, and evidence that geometry-aware SSL can enhance pose understanding while maintaining semantic richness, with potential impact on robotics, video analysis, and real-world perception systems.

Abstract

Learning visual features from unlabeled images has proven successful for semantic categorization, often by mapping different $views$ of the same object to the same feature to achieve recognition invariance. However, visual recognition involves not only identifying $what$ an object is but also understanding $how$ it is presented. For example, seeing a car from the side versus head-on is crucial for deciding whether to stay put or jump out of the way. While unsupervised feature learning for downstream viewpoint reasoning is important, it remains under-explored, partly due to the lack of a standardized evaluation method and benchmarks. We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory, without any semantic or pose labels. We benchmark both semantic classification and pose estimation accuracies on the same visual feature. Additionally, we propose a viewpoint trajectory regularization loss for learning features from unlabeled image triplets. Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity and organizes objects by their poses, retaining semantic classification accuracy while achieving emergent global pose awareness and better generalization to novel objects. Our dataset and code are available at http://pwang.pw/trajSSL/.

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

TL;DR

Abstract

Learning visual features from unlabeled images has proven successful for semantic categorization, often by mapping different

of the same object to the same feature to achieve recognition invariance. However, visual recognition involves not only identifying

an object is but also understanding

it is presented. For example, seeing a car from the side versus head-on is crucial for deciding whether to stay put or jump out of the way. While unsupervised feature learning for downstream viewpoint reasoning is important, it remains under-explored, partly due to the lack of a standardized evaluation method and benchmarks. We introduce a new dataset of adjacent image triplets obtained from a viewpoint trajectory, without any semantic or pose labels. We benchmark both semantic classification and pose estimation accuracies on the same visual feature. Additionally, we propose a viewpoint trajectory regularization loss for learning features from unlabeled image triplets. Our experiments demonstrate that this approach helps develop a visual representation that encodes object identity and organizes objects by their poses, retaining semantic classification accuracy while achieving emergent global pose awareness and better generalization to novel objects. Our dataset and code are available at http://pwang.pw/trajSSL/.

Paper Structure (21 sections, 4 equations, 16 figures, 9 tables)

This paper contains 21 sections, 4 equations, 16 figures, 9 tables.

Introduction
Related Works
A Benchmark for SSL Geometric Representations
The Problem Setting
Data and Evaluation Metrics
Enhancing Geometric Representation Learning
Mid-Layer Representation for Evaluation
Trajectory Regularization
Experiments
Training Protocols
Evaluation Protocols
Evaluation on Last Feature-Layer
Evaluation on Mid-Layer Representations
Visualizations
More Dataset Details
...and 6 more sections

Figures (16)

Figure 1: Our goal is to capture two aspects of object recognition through SSL: what the object is and how the object is presented. While the former has been well studied chen2020simplebardes2021vicreg, the latter is rarely understood. We learn SSL representations that not only capture object semantics but also pose. a) The training data are image triplets with subtle viewpoint changes of objects. The object identity, semantics and pose are unknown. b) The learned representations are expected to discriminate different object semantics and poses, achieving high accuracies for both semantic classification and pose estimation. Notably, we expect to understand global pose from local pose changes. c) Our approach improves pose estimation accuracy over existing methods bardes2021vicregchen2020simplechen2020simsiam by encouraging images with similar poses to form smooth trajectories in the representation space.
Figure 2: Our benchmark dataset contains rendered images from ShapeNet chang2015shapenet. Left: For semantics, we use non-overlapping 13 in-domain semantic categories and 11 out-of-domain categories. We project in-domain and out-of-domain semantic classes with PCA-projected Word2Vec church2017word2vec and show a representative object with $(15^{\circ},15^{\circ})$. Right: For pose, we adopt absolute and relative pose estimation as tasks. Notably, relative pose enables SSL's generalizability test on out-of-domain data as it eliminates the need for category-specific canonical pose. The (camera) pose is defined as the spherical coordinate (azimuth, elevation) of the camera position. We render objects from $n$ unique camera angles, uniformly distributed across the viewing sphere $S^2$, utilizing a Fibonacci sphere distribution alexa2022super, denoted as $\text{Fib}(n)$ (more details in Fig.\ref{['fig:relpose']} in supplementary). We use $\text{Fib}(50)$ as in-domain training and $\text{Fib}(100)$ for out-of-domain evaluations. In-domain and out-of-domain set statistics are in Table \ref{['tab:dataset']} in supplementary.
Figure 3: We impose an unsupervised loss on the feature representations, after feeding the image through an encoder ( a). In addition to an unsupervised semantic loss $\mathcal{L}_{\text{sem}}$ ( b) which is commonly used in SSL, we add a trajectory loss $\mathcal{L}_{\text{traj}}$ (Eqn.\ref{['eq:traj']}) ( c) to enhance geometric representation. $\mathcal{L}_{\text{sem}}$ always follows baseline settings, which is applied post-projector for SimCLR chen2020simple, for example. $\mathcal{L}_{\text{traj}}$ always operates on the pooled feature $z$. For pose evaluation, we allow representations from different layers and find that mid-layer representations like "res block3" give pose estimation gain.
Figure 4: We enforce representations of adjacent views of an object, $\mathbf{z_L},\mathbf{z_C},\mathbf{z_R}$, to form a geodesic trajectory. Upper:$\mathbf{z}$ resides on a unit hypersphere. The objective is to map the difference vectors $\mathbf{v_1} = \mathbf{z_C} - \mathbf{z_L}$ and $\mathbf{v_2} = \mathbf{z_R} - \mathbf{z_C}$ onto $\mathbf{z_C}$'s tangent plane, optimizing for maximal cosine similarity to achieve a linear trajectory on that plane. Lower: Projected vector $\mathbf{u}$ is computed by deducting the normal component $\mathbf{z_C}$ from the difference vector $\mathbf{v}$.
Figure 5: Our trajectory regularization consistently achieves higher relative pose estimation accuracy for in-domain, out-of-domain semantic categories and in-domain, out-of-domain poses. The bottom right figure shows the performance on real dataset shaler2017carvana, whose high performance is due to its easier pose classification setting than simulation with Fib(50)/Fib(100) pose estiamtion. Our trajectory loss $\mathcal{L}_{\text{traj}}$ leads to pose estimation gain without harming semantic classification accuracy. Specifically, SSL gives comparable or marginally superior results than supervised methods for out-of-domain and real data. Feature-layer representation $z$ is used for pose estimation.
...and 11 more figures

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

TL;DR

Abstract

Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (16)