Table of Contents
Fetching ...

Understanding Multi-View Transformers

Michal Stary, Julien Gaubil, Ayush Tewari, Vincent Sitzmann

TL;DR

This work tackles the interpretability gap in feed-forward multi-view transformers for 3D reconstruction by proposing a probing framework that attaches probes to the skip connections of a DUSt3R-like model to regress pointmaps from per-patch features. By visualizing these pointmaps across decoder layers, the study reveals how latent 3D geometry and correspondences are built over time, and whether global pose information is required. The findings show that self-attention mainly drives geometric refinement of the second view, while cross-attention discovers and refines correspondences from semantic to geometric, with correspondences becoming more accurate as decoding progresses. This approach provides mechanistic insights into spatial geometry in multi-view transformers and offers a practical methodology for analyzing and understanding such models, with potential impact on safety- and reliability-critical 3D tasks.

Abstract

Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .

Understanding Multi-View Transformers

TL;DR

This work tackles the interpretability gap in feed-forward multi-view transformers for 3D reconstruction by proposing a probing framework that attaches probes to the skip connections of a DUSt3R-like model to regress pointmaps from per-patch features. By visualizing these pointmaps across decoder layers, the study reveals how latent 3D geometry and correspondences are built over time, and whether global pose information is required. The findings show that self-attention mainly drives geometric refinement of the second view, while cross-attention discovers and refines correspondences from semantic to geometric, with correspondences becoming more accurate as decoding progresses. This approach provides mechanistic insights into spatial geometry in multi-view transformers and offers a practical methodology for analyzing and understanding such models, with potential impact on safety- and reliability-critical 3D tasks.

Abstract

Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .

Paper Structure

This paper contains 38 sections, 12 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Overview of our interpretability approach. We probe the skip connections of DUSt3R's multi-view transformer decoder for pointmaps and visualize the gradual development of DUSt3R's latent 3D geometry, correspondence, and camera pose estimate.
  • Figure 2: Probing mechanism. We train our probe to regress the patch of a pointmap from the corresponding vision transformer patch token.
  • Figure 3: Visualization of the pointmaps probed after skip connection across the decoder. We show probes after self-attention (SA) layers at different blocks for two examples: hydrant (top) and chairs (bottom). The first view is tinted green, and the second view in red. We can observe the red-tinted patches as they move through 3D space toward their correct locations in a semi-rigid fashion.
  • Figure 4: Layer-wise analysis of geometric refinement in the second view. The poses of the predicted and ground-truth pointmaps are first aligned using a Procrustes solve. We then measure a scale and shift-invariant pointmap error to track the second view's geometry refinement across the network. Most of the error reduction in the decoder is achieved by self-attention layers.
  • Figure 5: Evolution of the percentage of valid correspondences across the decoder blocks. Correspondences are considered valid if their 2D error is less than 16 pixels (1 patch). An improving trend is observed over the first six blocks.
  • ...and 10 more figures