Table of Contents
Fetching ...

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

TL;DR

The Omnivorous Vision Encoder is proposed, a novel framework that learns a modality-agnostic feature space and enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

Abstract

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

TL;DR

The Omnivorous Vision Encoder is proposed, a novel framework that learns a modality-agnostic feature space and enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

Abstract

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
Paper Structure (49 sections, 6 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 49 sections, 6 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: Off-the-shelf vision encoders like DINO show poor cross-modal alignment. We show the similarity in feature space between randomly paired RGB images (top), between RGB images and depth maps of the same scene (middle), and between RGB and grayscale images of the same scene (bottom). While the numbers vary depending on the dataset, the pattern of misalignment between visual modalities remains consistent. Our proposed adapter aligns these modalities in an existing feature space.
  • Figure 2: Omnivorous Vision Encoder architecture. A frozen encoder $f^*$ extracts features $z_m = f^*(x_m)$ from a spectrum of modalities denoted $m$ (Segmentation, RGB, Depth). A trainable modality-agnostic adapter $g$ maps these features into a common, aligned embedding space, producing a modality-invariant representation $h = g(z_m)$. A convenient implementation of this architecture uses the early layers of a pretrained network as the frozen part $f^*$, and the later layers as the adapter $g$.
  • Figure 3: Training data: depth and segmentation maps are first colorized using a natural color palette derived from the corresponding RGB image. We then apply a data augmentation: we blend the colorized depth image with up to 50% of the RGB image (and likewise for the segmentation image). The compositing alpha is randomly sampled (between 0% to 50%) for each datapoint. The idea is to interpolate between the modalities (Depth $\leftrightarrow$ RGB $\leftrightarrow$ Seg) smoothly and teach the model a degree of invariance across the full spectrum, while also providing more negative examples for between-scene contrastive learning. Other potential benefits: the augmentation (i) makes our representations naturally invariant to scene lighting; and (ii) helps us cope with imperfect depth and segmentation values.
  • Figure 4: Analysis of the anchoring loss. (a) Trade-off between cross-modal alignment and cross-scene discernibility, controlled by $\lambda_{anchor}$. The x-axis measures alignment (cosine sim of $<$RGB, Depth$>$) and the y-axis measures discernibility (1 - cosine similarity of distinct RGB scenes) on ScanNet. Frozen DINOv2 (light blue) is discriminative but poorly aligned. (b) To pick a value for $\lambda_{anchor}$, we examine its effect on linear-head prediction performance, from Omnivorous features of RGB images, on Depth (NYUv2) and Segmentation (Cityscapes). We omit the datapoint for $\lambda_{anchor} = 0$ located at $(x=0.732, y=0.356)$ for clarity, as it was too far below the remaining datapoints.
  • Figure 5: PCA visualizations of frozen (DINO ViT-B/14) and adapted (Omnivorous ViT-B/14) features on two scenes.
  • ...and 4 more figures