Table of Contents
Fetching ...

Multi-View Foundation Models

Leo Segre, Or Hirschorn, Shai Avidan

TL;DR

The paper tackles the lack of 3D awareness in 2D foundation models by introducing Multi-View Foundation Models that enforce cross-view feature consistency. It achieves this with Multi-View Adapters and Plücker-based pose conditioning, enabling geometry-aware reasoning without per-scene optimization and across multiple backbones (DINOv2/DINOv3/CLIP/SAM). A geometry-aware dense loss plus regularization preserves semantic priors while aligning features across views, and extensive experiments on ScanNet++ and generalization sets demonstrate improved geometric consistency and robust downstream performance (surface normals and cross-view segmentation). This approach scales inference-time 3D-aware learning from 2D priors, offering practical benefits for 3D perception tasks without expensive scene-level optimizations.

Abstract

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.

Multi-View Foundation Models

TL;DR

The paper tackles the lack of 3D awareness in 2D foundation models by introducing Multi-View Foundation Models that enforce cross-view feature consistency. It achieves this with Multi-View Adapters and Plücker-based pose conditioning, enabling geometry-aware reasoning without per-scene optimization and across multiple backbones (DINOv2/DINOv3/CLIP/SAM). A geometry-aware dense loss plus regularization preserves semantic priors while aligning features across views, and extensive experiments on ScanNet++ and generalization sets demonstrate improved geometric consistency and robust downstream performance (surface normals and cross-view segmentation). This approach scales inference-time 3D-aware learning from 2D priors, offering practical benefits for 3D perception tasks without expensive scene-level optimizations.

Abstract

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.

Paper Structure

This paper contains 32 sections, 10 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Multi-View Foundation Models. We show how to adapt existing foundation models (e.g., DINO, SAM, CLIP) into inference-time multi-view consistent variants. Left: Top row: From a keypoint in one view (blue), our features yield geometrically accurate correspondence (green), while DINOv2 drifts (red). Bottom row: Cosine-similarity maps (with respect to the keypoint) show strong multi-view spatial consistency. Right: Location error vs. viewpoint angle on ScanNet++ test split. Our Multi-View DINOv2 maintains low error under large viewpoint changes, outperforming DINOv2 and FiT3D, whose accuracy degrades with increasing viewpoint difference.
  • Figure 2: Framework Overview. Our method integrates a pre-trained foundation model with multi-view spatial adapters (MV-Adapters) denoted in purple for multi-view consistent feature learning. These MV-Adapters are added after each Transformer block. Given $M$ input images with camera poses, per-view features are extracted and fused via 3D-aware adapter blocks conditioned on ray-based pose embeddings, producing geometry-consistent representations across views.
  • Figure 3: Feature Consistency Across Views. Numbered markers indicate query points in the first image, with dashed lines connecting to the most similar features in other views. MV-DINOv2 maintains geometric consistency with correspondences converging to the same 3D locations, while base DINOv2 exhibits geometric drift across viewpoints.
  • Figure 4: Feature Similarity to Base Model. For each input image (left), we visualize the feature embeddings of MV-DINOv2 (ours), DINOv2, and FiT3D after projecting all features into a shared PCA space. The strong visual alignment between MV-DINOv2 and the original DINOv2 shows that our multi-view adaptation preserves the semantic structure of the pretrained model, while FiT3D drifts further away. This confirms that our method achieves 3D consistency without sacrificing fidelity to the base features.
  • Figure 5: 3D Structure Embedding.(left) PCA visualization of our multi-view consistent DINOv2 (MV-DINOv2) features across scenes, showing clear semantic patterns. (right) PCA visualization of the difference between our MV-DINOv2 features and the base model’s features. The visualization reveals a clear 3D positional pattern that indicates the model encodes geometric information while maintaining semantic consistency with the base model.
  • ...and 9 more figures