Table of Contents
Fetching ...

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas

TL;DR

The paper demonstrates that large vision transformers exhibit partial 3D view equivariance and that higher 3D consistency correlates with better pose estimation, tracking, and semantic correspondence. It then shows a simple finetuning strategy using multiview correspondences and the SmoothAP loss to dramatically improve 3D correspondence understanding with minimal data and a lightweight head. The approach yields notable gains across downstream tasks and generalizes from synthetic to real imagery, with additional benefits observed in wild 3D tasks. Overall, the work provides a practical pathway to enhance 3D capabilities of 2D ViTs while keeping training requirements modest.

Abstract

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, finetuning on a single object for one iteration results in substantial gains. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

TL;DR

The paper demonstrates that large vision transformers exhibit partial 3D view equivariance and that higher 3D consistency correlates with better pose estimation, tracking, and semantic correspondence. It then shows a simple finetuning strategy using multiview correspondences and the SmoothAP loss to dramatically improve 3D correspondence understanding with minimal data and a lightweight head. The approach yields notable gains across downstream tasks and generalizes from synthetic to real imagery, with additional benefits observed in wild 3D tasks. Overall, the work provides a practical pathway to enhance 3D capabilities of 2D ViTs while keeping training requirements modest.

Abstract

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, finetuning on a single object for one iteration results in substantial gains. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

Paper Structure

This paper contains 51 sections, 24 figures, 17 tables.

Figures (24)

  • Figure 1: Improving 3D correspondence understanding through finetuning on feature equivariance.Left: finetuning feature equivariance on one synthetic object can already enhance the vision transformer's ability to generate better 3D feature correspondences on general objects. Right: This improvement further leads to superior performance across multiple 3D tasks, including pose estimation, video tracking, and semantic correspondence.
  • Figure 2: Feature visualizations of different models. The sample image is rendered from Objeverse. Colors are computed from the high-dimensional features using PCA. We can see that MAE struggles to distinguish different parts of the content (e.g. similar features between head and body). Both CLIP and DeiT produce inconsistent features for the chest region between View 1 and View 2. DINOv2 gives the best correspondence.
  • Figure 3: Correlation between multiview feature equivariance and the task performances. Along the horizontal axis, lower APE indicates better feature equivariance, while the vertical axis reflects higher task performance across all four plots. The data points align roughly along the diagonal from the top left to the bottom right, suggesting a strong correlation between improved feature equivariance and better task performance.
  • Figure 4: Illustration of different types of correspondence tasks evaluated in our work.
  • Figure 5: Generalization from synthetic images (Objaverse) to real images (MVImgNet).Left: Data points roughly around the diagonal from the bottom left to the upper right indicate the correlation between the APE tested on the two datasets. The * next to the model name means it is finetuned. All finetuning is done on Objaverse with only synthetic data. Right: Finetuned on Objaverse, the feature equivariance of the model (measured in PCDP) improves on MVImgNet.
  • ...and 19 more figures