Enhancing 2D Representation Learning with a 3D Prior
Mehmet Aygün, Prithviraj Dhar, Zhicheng Yan, Oisin Mac Aodha, Rakesh Ranjan
TL;DR
This work addresses the brittleness of 2D self-supervised representations by injecting a strong 3D prior through a single-view proxy task. Starting from a pre-trained 2D SSL backbone, the method trains a triplane-based 3D decoder to perform volume rendering-based reconstruction of both RGB images and pseudo depth maps, guided by a distillation loss to preserve existing 2D features. The approach yields consistent robustness gains across multiple datasets (e.g., ImageNet-Rendition, ImageNet-Sketch, PUG) and enhances shape bias, with only modest or neutral impact on standard downstream tasks like ImageNet classification and depth estimation. These results demonstrate that incorporating 3D structure from monocular imagery can produce more robust, shape-aware representations without requiring multi-view data or explicit 3D supervision, offering a data-efficient path to stronger visual representations.
Abstract
Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines.
