Table of Contents
Fetching ...

Domain Generalization for 6D Pose Estimation Through NeRF-based Image Synthesis

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

TL;DR

The paper tackles domain shift in 6D pose estimation for spacecraft by proposing NeRF-based data augmentation. It trains an in-the-wild NeRF on synthetic data to generate $S_{nerf}$ with diverse viewpoints, illumination via appearance embeddings, and texture through color perturbations, forming $S_{train}=S_{synth}\cup S_{nerf}$. Experiments on SPEED+ show substantial improvements in target-domain pose accuracy, with reductions of up to $55\%$ and $45\%$ in angular and translation errors on Lightbox and Sunlamp, respectively, and ablations confirm the value of appearance extrapolation and texture randomization. The approach demonstrates that NeRF-synthesized data can enable robust pose estimation even when real data or CAD models are limited, offering a scalable path for domain-generalizable 6D pose estimation in space missions.

Abstract

This work introduces a novel augmentation method that increases the diversity of a train set to improve the generalization abilities of a 6D pose estimation network. For this purpose, a Neural Radiance Field is trained from synthetic images and exploited to generate an augmented set. Our method enriches the initial set by enabling the synthesis of images with (i) unseen viewpoints, (ii) rich illumination conditions through appearance extrapolation, and (iii) randomized textures. We validate our augmentation method on the challenging use-case of spacecraft pose estimation and show that it significantly improves the pose estimation generalization capabilities. On the SPEED+ dataset, our method reduces the error on the pose by 50% on both target domains.

Domain Generalization for 6D Pose Estimation Through NeRF-based Image Synthesis

TL;DR

The paper tackles domain shift in 6D pose estimation for spacecraft by proposing NeRF-based data augmentation. It trains an in-the-wild NeRF on synthetic data to generate with diverse viewpoints, illumination via appearance embeddings, and texture through color perturbations, forming . Experiments on SPEED+ show substantial improvements in target-domain pose accuracy, with reductions of up to and in angular and translation errors on Lightbox and Sunlamp, respectively, and ablations confirm the value of appearance extrapolation and texture randomization. The approach demonstrates that NeRF-synthesized data can enable robust pose estimation even when real data or CAD models are limited, offering a scalable path for domain-generalizable 6D pose estimation in space missions.

Abstract

This work introduces a novel augmentation method that increases the diversity of a train set to improve the generalization abilities of a 6D pose estimation network. For this purpose, a Neural Radiance Field is trained from synthetic images and exploited to generate an augmented set. Our method enriches the initial set by enabling the synthesis of images with (i) unseen viewpoints, (ii) rich illumination conditions through appearance extrapolation, and (iii) randomized textures. We validate our augmentation method on the challenging use-case of spacecraft pose estimation and show that it significantly improves the pose estimation generalization capabilities. On the SPEED+ dataset, our method reduces the error on the pose by 50% on both target domains.
Paper Structure (15 sections, 1 equation, 7 figures, 5 tables)

This paper contains 15 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our domain generalization method. Instead of training a pose estimation network on a synthetic set $S_{synth}$, which lacks diversity, we train it on an augmented set $S_{train}$ that combines the synthetic set and $S_{nerf}$, a set of images synthesized by a Neural Radiance Field mildenhall2021nerf (NeRF) trained on synthetic images. Our proposal is shown to improve the accuracy of the pose estimation module on the real test set $S_{real}$. The diversity of the generated set, $S_{nerf}$, is ensured by randomizing the viewpoints distribution, the illumination conditions as well as the target texture, as further explained in \ref{['sec_nerf_augm']}.
  • Figure 2: (Left) Two real images depicting the spacecraft used in the SPEED+ dataset park2022speed+. (Right) Two synthetic images from SPEED+, generated using a simplified CAD model of the target spacecraft. The global shape of the spacecraft is correctly rendered but the synthetic images fail to capture its texture. In addition, the synthetic images do not contain the adverse illumination conditions encountered on the real ones.
  • Figure 3: Overview of the synthesis of an image $I$ depicting an object under a camera pose ($q$,$t$) through an in-the-wild NeRF, using our augmentation method. As explained in \ref{['sec_bacground_wild_nerfs']}, the scene is represented by a field that maps the coordinates of any point to its density and color. By aggregating those values along rays passing through each pixel of the camera, the image can be generated. Our augmentation method leverages the NeRF to enable the generation of images, (A) taken under novel viewpoints, (B) depicting rich illumination conditions, and (C) presenting diverse textures. This significantly improves the diversity of the generated set, which, in turn, enhances the generalization capabilities of a pose estimator trained on that set.
  • Figure 4: Examples of images generated by our NeRF-based augmentation method. While the synthetic images on which the NeRF was trained only depicts a texture-less target spacecraft exposed to smooth illumination conditions (see \ref{['fig_mismatches_domain']}), our images exhibit a much larger diversity in terms of both illumination conditions and texture. In addition, our method enables the synthesis of images taken from novel viewpoints. For a comprehensive visual demonstration of this diversity, see the supplementary video.
  • Figure 5: Images generated through appearance interpolation/extrapolation. Each line depicts the same target, under the same pose, but using different weights $\alpha$ for the extrapolation of the appearance embedding. The illumination conditions are more diverse when the appearance is extrapolated ($\alpha \in [-4, 4]$) rather than only interpolated ($\alpha \in [0, 1]$). See the supplementary video for an example of a sequence generated through a progressive extrapolation between two appearance embeddings.
  • ...and 2 more figures