Table of Contents
Fetching ...

Convolution kernel adaptation to calibrated fisheye

Bruno Berenguel-Baeta, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez-Yus, Jose J. Guerrero

TL;DR

This work tackles the domain gap between perspective CNNs and fisheye imagery by introducing camera-calibrated deformable convolutions based on the Kannala-Brandt projection. Kernels are warped according to calibration-derived offsets, enabling receptive fields that match radial distortion and allowing transfer from perspective-trained networks with limited fine-tuning. The authors implement calibrated kernels for radially distorted fisheye cameras and demonstrate improvements in monocular depth estimation and semantic segmentation over standard convolutions, especially at larger distortion radii. This approach facilitates leveraging large perspective datasets for fisheye-enabled scene understanding without requiring massive new datasets for each camera calibration, with potential extension to other projection models.

Abstract

Convolution kernels are the basic structural component of convolutional neural networks (CNNs). In the last years there has been a growing interest in fisheye cameras for many applications. However, the radially symmetric projection model of these cameras produces high distortions that affect the performance of CNNs, especially when the field of view is very large. In this work, we tackle this problem by proposing a method that leverages the calibration of cameras to deform the convolution kernel accordingly and adapt to the distortion. That way, the receptive field of the convolution is similar to standard convolutions in perspective images, allowing us to take advantage of pre-trained networks in large perspective datasets. We show how, with just a brief fine-tuning stage in a small dataset, we improve the performance of the network for the calibrated fisheye with respect to standard convolutions in depth estimation and semantic segmentation.

Convolution kernel adaptation to calibrated fisheye

TL;DR

This work tackles the domain gap between perspective CNNs and fisheye imagery by introducing camera-calibrated deformable convolutions based on the Kannala-Brandt projection. Kernels are warped according to calibration-derived offsets, enabling receptive fields that match radial distortion and allowing transfer from perspective-trained networks with limited fine-tuning. The authors implement calibrated kernels for radially distorted fisheye cameras and demonstrate improvements in monocular depth estimation and semantic segmentation over standard convolutions, especially at larger distortion radii. This approach facilitates leveraging large perspective datasets for fisheye-enabled scene understanding without requiring massive new datasets for each camera calibration, with potential extension to other projection models.

Abstract

Convolution kernels are the basic structural component of convolutional neural networks (CNNs). In the last years there has been a growing interest in fisheye cameras for many applications. However, the radially symmetric projection model of these cameras produces high distortions that affect the performance of CNNs, especially when the field of view is very large. In this work, we tackle this problem by proposing a method that leverages the calibration of cameras to deform the convolution kernel accordingly and adapt to the distortion. That way, the receptive field of the convolution is similar to standard convolutions in perspective images, allowing us to take advantage of pre-trained networks in large perspective datasets. We show how, with just a brief fine-tuning stage in a small dataset, we improve the performance of the network for the calibrated fisheye with respect to standard convolutions in depth estimation and semantic segmentation.
Paper Structure (10 sections, 3 equations, 6 figures, 2 tables)

This paper contains 10 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of how a standard convolution is deformed by the Kannala-Brandt's projection model in a fisheye image. a) Several convolutional kernels adapted to the calibrated fisheye image. b) How the convolutional kernels are computed with the Kannala-Brandt's projection model.
  • Figure 2: Comparison and results of depth estimation with U-Net neural network with standard (red) and calibrated (blue) convolutions. The x-axis defines the distance of the pixels to the optical center and the y-axis the computed error, defined as mean and one standard deviation.
  • Figure 3: Qualitative results of monocular depth estimation on different fisheye calibrations. Distance is in a color scale, from colder colors (closer distances) to warmer colors (farther distances).
  • Figure 4: Qualitative results of depth estimation for FOV of 195º, top view of a 3D point cloud generated from depth data.
  • Figure 5: Comparison and results of semantic segmentation with the U-Net like network with standard (red) and calibrated (blue) convolutions. The x-axis defines the distance of the pixels to the optical center and the y-axis the computed error, defined as mean and one standard deviation.
  • ...and 1 more figures