Table of Contents
Fetching ...

Enhancing 2D Representation Learning with a 3D Prior

Mehmet Aygün, Prithviraj Dhar, Zhicheng Yan, Oisin Mac Aodha, Rakesh Ranjan

TL;DR

This work addresses the brittleness of 2D self-supervised representations by injecting a strong 3D prior through a single-view proxy task. Starting from a pre-trained 2D SSL backbone, the method trains a triplane-based 3D decoder to perform volume rendering-based reconstruction of both RGB images and pseudo depth maps, guided by a distillation loss to preserve existing 2D features. The approach yields consistent robustness gains across multiple datasets (e.g., ImageNet-Rendition, ImageNet-Sketch, PUG) and enhances shape bias, with only modest or neutral impact on standard downstream tasks like ImageNet classification and depth estimation. These results demonstrate that incorporating 3D structure from monocular imagery can produce more robust, shape-aware representations without requiring multi-view data or explicit 3D supervision, offering a data-efficient path to stronger visual representations.

Abstract

Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines.

Enhancing 2D Representation Learning with a 3D Prior

TL;DR

This work addresses the brittleness of 2D self-supervised representations by injecting a strong 3D prior through a single-view proxy task. Starting from a pre-trained 2D SSL backbone, the method trains a triplane-based 3D decoder to perform volume rendering-based reconstruction of both RGB images and pseudo depth maps, guided by a distillation loss to preserve existing 2D features. The approach yields consistent robustness gains across multiple datasets (e.g., ImageNet-Rendition, ImageNet-Sketch, PUG) and enhances shape bias, with only modest or neutral impact on standard downstream tasks like ImageNet classification and depth estimation. These results demonstrate that incorporating 3D structure from monocular imagery can produce more robust, shape-aware representations without requiring multi-view data or explicit 3D supervision, offering a data-efficient path to stronger visual representations.

Abstract

Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines.
Paper Structure (14 sections, 4 equations, 5 figures, 4 tables)

This paper contains 14 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Humans have no difficulty in recognizing the categories depicted in the above images, even though the texture of the objects has been perturbed. This is thought to be in large part due to our reliance on shape, as opposed to texture, cues landau1988importancespelke2007coregeirhos2018imagenet. However, an automated recognition system built on top of a state-of-the-art self-supervised representation learning approach (i.e., DINOv2 oquab2023dinov2) classifies these examples as dog, chair, and knife respectively, as the texture of the images resembles those object classes. We introduce a new approach to improve the robustness of self-supervised methods using a proxy 3D reconstruction task which encourages representations that emphasize shape cues more. As a result, our approach correctly predicted these examples as bear, car, and elephant.
  • Figure 2: Overview of our self-supervised single-view 3D reconstruction approach. Given an input image, $I$, we first extract a representation of the image using an encoder network, $h = f(I)$. Then using a decoder network, $\Phi$, we generate triplane features chan2022efficient3dgp. Using volume rendering mildenhall2021nerf, conditioned on a fixed camera location, we reconstruct the input image, $I_{rec}$, and its depth $D_{rec}$. We optimize all networks using a combination of reconstruction losses on the input image, $\mathcal{L}_{rgb}$, and estimated depth, $\mathcal{L}_{depth}$, along with a distillation loss, $\mathcal{L}_{dist}$, from a frozen 2D self-supervised learning model to prevent the forgetting of already learned informative representations.
  • Figure 3: Here we compare top-5 predictions from linear classifiers that are trained on original DINOv2 oquab2023dinov2 backbone features (shown in red) and our 3D enhanced approach (shown in blue) on various challenging examples from ImageNet-Rendition hendrycks2021many and ImageNet-Sketch wang2019learning. Our method results in more shape information being encoded in the representation and hence leads to classifiers that are more robust for these challenging out-of-distribution examples.
  • Figure 4: Quantification of the shape bias of different DINOv2 oquab2023dinov2 representations with and without our 3D-Prior method. We calculate the shape bias using the data and protocol from geirhos2018imagenet. Our approach increases the shape bias of visual recognition models and we observe that with larger backbones, the difference grows.
  • Figure 5: We compare the performance of our approach using different amounts of training data from that same source for the 3D proxy task with DinoV2 ViT-B/14 backbones. Surprisingly, we observe that more data does not change the performance drastically, which shows that our method is data efficient.