A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Alon Kaya; Igal Bilik; Inna Stainvas

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Alon Kaya, Igal Bilik, Inna Stainvas

TL;DR

The study analyzes how Vision Transformers (ViTs) and CNN backbones perform as backbones for regression tasks in geometry under few-shot conditions, specifically for 2D rigid transformation and fundamental matrix estimation. It introduces a unified encoder–regression pipeline with location-aware pooling, evaluated across a small ImageNet-derived dataset and two stereo datasets (KITTI and FlyingThings3D), including cross-domain scenarios. Key findings show ViTs outperform CNNs when ample data is available, but CNNs’ inductive bias confers advantages in data-scarce regimes, with smaller ViT patch sizes and self-supervised models (DINO) offering robust local representations; CNNs pretrained on large image–text datasets (CLIP) also show strong few-shot transfer. The results highlight the value of balancing local and global representations, and point toward hybrid CNN–ViT architectures and dynamic F-estimation for moving-camera settings as promising directions for improving geometric regression under limited data.

Abstract

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

TL;DR

Abstract

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)