Table of Contents
Fetching ...

InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset

Felix Stillger, Lukas Hahn, Frederik Hasecke, Tobias Meisen

Abstract

Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.

InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset

Abstract

Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.

Paper Structure

This paper contains 42 sections, 15 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Our InCaRPose predicts the relative camera pose between a reference and a target view (shown as camera frustums). Trained exclusively on synthetic images, the model generalizes to real-world cabin environments and enables camera extrinsic calibration.
  • Figure 2: Camera coordinate system of the standard view compared to the vehicle's coordinate system.
  • Figure 3: Standard view comparison: (left) real-world image and (right) synthetic image from the simulation environment.
  • Figure 4: InCaRPose's architecture overview. Two images are encoded by a frozen ViT backbone and fused by a cross-attention Transformer decoder. A prediction head outputs the relative camera pose between the views, optionally in both directions.
  • Figure 5: Qualitative results on real-world inference. All translation vectors are normalized to a common scale for visualization.
  • ...and 5 more figures