Table of Contents
Fetching ...

Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

Mateusz Michalkiewicz, Sheena Bai, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan

TL;DR

This work tackles how vision foundation models handle camera viewpoint changes, challenging the assumption that scene content is viewpoint-invariant in learned features. It defines an instability metric ins_f(v_i) based on feature-space distances and a neighbor-based averaging, and then classifies viewpoints as stable, accidental, or OOD without using image inputs. By evaluating nine diverse featurizers on ABO and CO3D, the paper shows accidental viewpoints are consistently encoded across models while OOD viewpoints vary with model biases, and that instability degrades zero-shot, linear probing, VQA, and monocular 3D reconstruction performance. The findings highlight the need for robustness measures, such as stability confidences and regularization, to produce reliable downstream outputs under diverse viewing conditions and to guide safer deployment of foundation-model features in 3D reasoning tasks.

Abstract

In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying out-of-distribution (OOD), accidental, and stable viewpoints using feature representations alone, without accessing the actual images. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of OOD viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.

Not all Views are Created Equal: Analyzing Viewpoint Instabilities in Vision Foundation Models

TL;DR

This work tackles how vision foundation models handle camera viewpoint changes, challenging the assumption that scene content is viewpoint-invariant in learned features. It defines an instability metric ins_f(v_i) based on feature-space distances and a neighbor-based averaging, and then classifies viewpoints as stable, accidental, or OOD without using image inputs. By evaluating nine diverse featurizers on ABO and CO3D, the paper shows accidental viewpoints are consistently encoded across models while OOD viewpoints vary with model biases, and that instability degrades zero-shot, linear probing, VQA, and monocular 3D reconstruction performance. The findings highlight the need for robustness measures, such as stability confidences and regularization, to produce reliable downstream outputs under diverse viewing conditions and to guide safer deployment of foundation-model features in 3D reasoning tasks.

Abstract

In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying out-of-distribution (OOD), accidental, and stable viewpoints using feature representations alone, without accessing the actual images. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of OOD viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.
Paper Structure (16 sections, 1 equation, 8 figures, 3 tables)

This paper contains 16 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Viewpoint stability of a featurizer. The blue camera represent small perturbations in camera space that result in correspondingly small changes in a feature space of a featurization function, indicating a stable viewpoint. In contrast, the red camera represent small perturbations in camera space that lead to large changes in feature space, signifying an unstable viewpoint.
  • Figure 2: Instability scores of CLIP radford2021learning features with respect to viewpoints for three scenes in the ABO collins2022abo dataset. (Left) a wooden side table with no instabilities. (Center) a painting with two accidental viewpoints, where the structure of the painting is hidden due to viewpoint. (Right) High-back cushion with an uncommon (OOD) viewpoint.
  • Figure 3: Examples of accidental and OOD viewpoints using CLIP and DINO embeddings across the ABO and CO3D datasets. Accidental views obscure object's true structure, while OOD views present uncommon orientations rarely or never seen during training. In the CO3D dataset, additional sources of instability include occlusions, image blur, and objects displayed upside down.
  • Figure 4: PCA visualization of stable (blue), unstable-accidental (green), and unstable-OOD (red) viewpoints using CLIP embeddings from ABO dataset We see that unstable viewpoints generally cluster in feature space, with accidental viewpoints tightly grouped together. Please refer to sample images from the unstable-accidental and unstable-OOD in Fig. \ref{['fig:cluster_sample_combined']}.
  • Figure 5: Percentage of generated captions achieving an F1 score above 0.5 (F1@0.5) (higher is better) for stable, OOD, and accidental viewpoints on the ABO dataset. The F1 score is calculated using BERTScore between generated captions and ground truth (GT) captions, where GT captions are generated with access to the object's label information. Captions generated from stable viewpoints achieve higher F1 scores compared to those from accidental viewpoints, indicating that viewpoint instability affects caption generation accuracy.
  • ...and 3 more figures