Table of Contents
Fetching ...

Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses

Ruxiao Duan, Erin Hong, Dongxu Zhao, Eric Turner, Alex Wong, Yunwen Zhou

Abstract

Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, $π^3$, and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.

Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses

Abstract

Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, , and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.

Paper Structure

This paper contains 15 sections, 16 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Foundation models for 3D reconstruction generally fail on fisheye images. Top: Ground truth sparse LiDAR points of a fisheye image, with undistorted perspective images with varying FoVs. Bottom: Reconstructed point clouds from the corresponding images by keetha2025mapanything; left most one is calibrated by our method. Naively undistorting the fisheye image for 3D reconstruction either sacrifices scene coverage or introduces extreme resampling artifacts and peripheral stretching that lie far outside the model's training distribution, as evident from the mismatch of predicted and actual FoVs.
  • Figure 2: Fisheye adaptation framework with perspective training data. Our method leverages existing perspective datasets by synthesizing curvilinear distortion to the input sequences. The distorted images are processed by the frozen transformer backbone, where learnable calibration tokens are injected into the encoder layers to align fisheye features with the model's internal perspective manifold. The decoder predicts dense geometric attributes, e.g., ray directions and ray depth, which are undistorted for loss computation in the optimization of calibration tokens. Supervision is derived either from ground truth labels or through a self-supervised distillation of the model's own predictions on the original undistorted perspective sequence.
  • Figure 3: Masked attention for mixture sequences. To handle a sequence of mixed camera types, we introduce a masked attention mechanism to block the influence of calibration tokens on perspective image tokens. After $L_0$ layers of feature extraction in image encoder, the class tokens are passed to a linear classifier for camera type prediction. Then, frame-wise and global attention masks are created accordingly and used in all the remaining encoder layers, including the $L_1 - L_0$ ones in the image encoder, and the $L_2$ frame and $L_2$ global attention layers in alternating attention.
  • Figure 4: Qualitative comparison of FoV, depth, and point map estimation on an indoor (top) and outdoor (bottom) fisheye sequence. We apply calibration tokens on MapAnything and consistently produces more geometrically consistent reconstructions compared to the baseline models. The yellow arrows highlight the regions where our calibration tokens effectively rectify significant geometric distortions.
  • Figure 5: Performance analysis on hybrid sequences of ScanNet++ with varying perspective image ratios. Pre-trained: the vanilla MapAnything backbone without adaptation. C.T. (Fisheye Training): calibration tokens trained on fully fisheye sequences. C.T. (Mix Training): tokens trained on mixture sequences of both camera types. C.T. (Mix Training + M.A.): masked attention introduced during mixed training.
  • ...and 7 more figures