Table of Contents
Fetching ...

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Alon Kaya, Igal Bilik, Inna Stainvas

TL;DR

The study analyzes how Vision Transformers (ViTs) and CNN backbones perform as backbones for regression tasks in geometry under few-shot conditions, specifically for 2D rigid transformation and fundamental matrix estimation. It introduces a unified encoder–regression pipeline with location-aware pooling, evaluated across a small ImageNet-derived dataset and two stereo datasets (KITTI and FlyingThings3D), including cross-domain scenarios. Key findings show ViTs outperform CNNs when ample data is available, but CNNs’ inductive bias confers advantages in data-scarce regimes, with smaller ViT patch sizes and self-supervised models (DINO) offering robust local representations; CNNs pretrained on large image–text datasets (CLIP) also show strong few-shot transfer. The results highlight the value of balancing local and global representations, and point toward hybrid CNN–ViT architectures and dynamic F-estimation for moving-camera settings as promising directions for improving geometric regression under limited data.

Abstract

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

TL;DR

The study analyzes how Vision Transformers (ViTs) and CNN backbones perform as backbones for regression tasks in geometry under few-shot conditions, specifically for 2D rigid transformation and fundamental matrix estimation. It introduces a unified encoder–regression pipeline with location-aware pooling, evaluated across a small ImageNet-derived dataset and two stereo datasets (KITTI and FlyingThings3D), including cross-domain scenarios. Key findings show ViTs outperform CNNs when ample data is available, but CNNs’ inductive bias confers advantages in data-scarce regimes, with smaller ViT patch sizes and self-supervised models (DINO) offering robust local representations; CNNs pretrained on large image–text datasets (CLIP) also show strong few-shot transfer. The results highlight the value of balancing local and global representations, and point toward hybrid CNN–ViT architectures and dynamic F-estimation for moving-camera settings as promising directions for improving geometric regression under limited data.

Abstract

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.

Paper Structure

This paper contains 35 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Schematic representation of the proposed network for comparison between and used as the backbone models. The network contains encoder and regression modules. The outputs patch embeddings and an unused CLS token for each frame. The outputs aggregated feature maps. The backbone outputs of both frames are reshaped to a data structure of the size, $(D, \sqrt{N}, \sqrt{N})$, where $D$ is the embedding dimension and $N$ is the number of spatial patches for or number of pixels in feature maps for . These embeddings are concatenated along the depth dimension and passed through two convolutional layers, followed by location-aware Max-pooling. The regression module for the $\mathbf{F}$-matrix estimation task consists of a rank-constraint layer, and the rigid estimation consists of the $3$ transformation elements, each using different normalization methods.
  • Figure 2: Rigid transformation estimation performance. (a) Euclidean translation error in pixels. (b) MAE rotation error in degrees. Solid lines represent ViT-based models and dashed lines represent CNN-based models. For the visualization, are divided by factor, $3$.
  • Figure 3: SED (a) and AD (b) metrics for $\mathbf{F}$-matrix estimation using and models with various training data sizes on the KITTI dataset. For visualization, are divided by $3$.
  • Figure 4: SED results for the fundamental matrix estimation task on the KITTI dataset, comparing the largest (a) and smallest (b) training data sizes. Bar plots represent SED performance, and line plots indicate corresponding model parameter counts. In the small-data regime (b), CLIP-ResNet-101 achieves the lowest SED, benefiting from extensive pretraining on image–text alignment using large-scale datasets, and a lower parameter count, which reduces overfitting. Among transformer-based models, DINO-ViT-B/16 slightly outperforms CLIP-ViT-B/16. In the large-data regime (a), DINO-ViT-B16 achieves the best performance, outperforming all other models, including CLIP-ResNet variants. This highlights the advantage of its self-supervised learning (SSL) objective, particularly in tasks requiring fine-grained geometric reasoning. Transformers with a patch size of 16 consistently perform well, likely due to better access to local spatial information. Overall, DINO-ViT-B16 emerges as the most stable model across varying data scales. Our experiments also demonstrate that CLIP-ResNet-101 and CLIP-ResNet-50x4 perform best among CNN-based models, attributable to their training on large-scale WebImageText datasets using image–text alignment tasks.
  • Figure 5: SED (a) and AD (b) for $\mathbf{F}$-matrix estimation using CLIP-ViT-B/32 freezing bottom layers. For visualization, are divided by $3$.
  • ...and 3 more figures