Understanding Multi-View Transformers
Michal Stary, Julien Gaubil, Ayush Tewari, Vincent Sitzmann
TL;DR
This work tackles the interpretability gap in feed-forward multi-view transformers for 3D reconstruction by proposing a probing framework that attaches probes to the skip connections of a DUSt3R-like model to regress pointmaps from per-patch features. By visualizing these pointmaps across decoder layers, the study reveals how latent 3D geometry and correspondences are built over time, and whether global pose information is required. The findings show that self-attention mainly drives geometric refinement of the second view, while cross-attention discovers and refines correspondences from semantic to geometric, with correspondences becoming more accurate as decoding progresses. This approach provides mechanistic insights into spatial geometry in multi-view transformers and offers a practical methodology for analyzing and understanding such models, with potential impact on safety- and reliability-critical 3D tasks.
Abstract
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .
