Table of Contents
Fetching ...

Blind Augmentation: Calibration-free Camera Distortion Model Estimation for Real-time Mixed-reality Consistency

Siddhant Prakash, David R. Walton, Rafael K. dos Anjos, Anthony Steed, Tobias Ritschel

TL;DR

The paper tackles the challenge of achieving visual consistency between real camera footage and virtual content in real-time mixed-reality without requiring camera calibration or markers. It introduces blind augmentation, which jointly learns a distortion model for noise, motion blur, and depth-of-field from arbitrary videos, then uses this model to synthesize corresponding distortions for virtual objects in real time. The approach combines depth and motion estimation with lightweight, end-to-end optimization to recover parameters such as $\lambda$, $\delta$, and $\sigma$, and then applies MB, DoF, and noise using off-the-shelf renderers, enabling fast startup and robust AR compositing. Extensive qualitative, quantitative, user-study, and real-time demonstrations (including a Unity demo on a Meta Quest 3) show that the method matches or exceeds marker-based baselines without requiring prior calibration, while maintaining practical runtime. This work offers a practical path to calibration-free, high-fidelity AR alignment in consumer MR devices.

Abstract

Real camera footage is subject to noise, motion blur (MB) and depth of field (DoF). In some applications these might be considered distortions to be removed, but in others it is important to model them because it would be ineffective, or interfere with an aesthetic choice, to simply remove them. In augmented reality applications where virtual content is composed into a live video feed, we can model noise, MB and DoF to make the virtual content visually consistent with the video. Existing methods for this typically suffer two main limitations. First, they require a camera calibration step to relate a known calibration target to the specific cameras response. Second, existing work require methods that can be (differentiably) tuned to the calibration, such as slow and specialized neural networks. We propose a method which estimates parameters for noise, MB and DoF instantly, which allows using off-the-shelf real-time simulation methods from e.g., a game engine in compositing augmented content. Our main idea is to unlock both features by showing how to use modern computer vision methods that can remove noise, MB and DoF from the video stream, essentially providing self-calibration. This allows to auto-tune any black-box real-time noise+MB+DoF method to deliver fast and high-fidelity augmentation consistency.

Blind Augmentation: Calibration-free Camera Distortion Model Estimation for Real-time Mixed-reality Consistency

TL;DR

The paper tackles the challenge of achieving visual consistency between real camera footage and virtual content in real-time mixed-reality without requiring camera calibration or markers. It introduces blind augmentation, which jointly learns a distortion model for noise, motion blur, and depth-of-field from arbitrary videos, then uses this model to synthesize corresponding distortions for virtual objects in real time. The approach combines depth and motion estimation with lightweight, end-to-end optimization to recover parameters such as , , and , and then applies MB, DoF, and noise using off-the-shelf renderers, enabling fast startup and robust AR compositing. Extensive qualitative, quantitative, user-study, and real-time demonstrations (including a Unity demo on a Meta Quest 3) show that the method matches or exceeds marker-based baselines without requiring prior calibration, while maintaining practical runtime. This work offers a practical path to calibration-free, high-fidelity AR alignment in consumer MR devices.

Abstract

Real camera footage is subject to noise, motion blur (MB) and depth of field (DoF). In some applications these might be considered distortions to be removed, but in others it is important to model them because it would be ineffective, or interfere with an aesthetic choice, to simply remove them. In augmented reality applications where virtual content is composed into a live video feed, we can model noise, MB and DoF to make the virtual content visually consistent with the video. Existing methods for this typically suffer two main limitations. First, they require a camera calibration step to relate a known calibration target to the specific cameras response. Second, existing work require methods that can be (differentiably) tuned to the calibration, such as slow and specialized neural networks. We propose a method which estimates parameters for noise, MB and DoF instantly, which allows using off-the-shelf real-time simulation methods from e.g., a game engine in compositing augmented content. Our main idea is to unlock both features by showing how to use modern computer vision methods that can remove noise, MB and DoF from the video stream, essentially providing self-calibration. This allows to auto-tune any black-box real-time noise+MB+DoF method to deliver fast and high-fidelity augmentation consistency.

Paper Structure

This paper contains 29 sections, 6 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of our approach, comprising of a training part (top half) and a test or deployment phase (bottom half). Training starts with the input image $I$I at the top left that is fed into off-the-shelf depth and flow extractors, as well as off-the-shelf methods to remove noise, MB and DoF. These off-the-shelf processes are denoted as (black arrows). Next, the image difference between a re-synthesis of noise, MB and DoF is computed and compared to the input image (orange arrows). This error is minimized by back-propagating to the noise, MB and DoF parameters (blue arrows). This forms a model that knows the noise profile, how to blur for which depth or which motion (top right). At test time, we know flow and depth of a virtual RGB image, and hence can synthesize noise, MB and DoF (pink arrows) using off-the-shelf and fast methods, before composing a final image with consistency superior to no noise, MB and DoF.
  • Figure 2: Testing correctness of our approach by estimating camera parameters for different conditions. The first three columns are three different input images. In each column, this input is subject to increasingly strong distortion. The distortion is MB with falling balls in the first row (varying exposure), noise in the second row (varying exposure), and DoF in the third row (varying focus distance). Now, the last three columns are results of our algorithm, adding the right distortion, at the right extent to the same virtual content augmenting the input. We see that the virtual object looks different for different levels of distortion. We also see that the level of distortion is consistent with the input when composed. Please see the supplemental PDF for a discussion on the parameters recovered and the supplemental video to see the distortions in motion.
  • Figure 3: Applying our method to augment very similar scenes with the same virtual objects captured with entirely different cameras. We show original input images (top) for each augmented image (bottom). We see that, while distortions differ, our methods consistently transfers them all with plausible settings to the virtual object. Note that the scene framing differs, as the aspect and optics of these cameras differ. Also note that the difference in color reproduction (radiometric calibration) are -- besides the chromatic variance of the noise-- not the aim of this work.
  • Figure 4: Results of our Identification and our Consistency study on the top and bottom, respectively.
  • Figure 5: Recovered parameters using Optimization and FullPipeline on the top and bottom, respectively for MB (left) and DoF (right). For MB ground truth parameters are shown in blue and recovered parameters are shown in orange. For DoF ground truth models $G(Z)$ are shown with solid lines and recovered models $\hat{G}(Z)$ are shown with dotted lines. The colors denote different models of DoF. We observe good correlation between ground truth and recovered parameters in both experiments.
  • ...and 11 more figures