Table of Contents
Fetching ...

SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim

TL;DR

SenseShift6D tackles the real-world robustness problem in 6D pose estimation by introducing a physically captured RGB-D benchmark that orthogonally varies exposure, gain, depth-capture mode, and illumination. The study shows that test-time multimodal sensor control yields substantial accuracy gains for both pretrained generalizable models and instance-level estimators, outperforming fixed configurations without retraining and reducing performance disparities across scenes. By revealing sensor variation as a distinct axis of generalization, the work provides a foundation for sensor-aware robustness and adaptive perception systems. Overall, SenseShift6D enables systematic evaluation and development of self-tuning perception under environmental and hardware variability.

Abstract

Recent advances on 6D object-pose estimation have achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode-and the potential of test-time sensor control to mitigate such variations-largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For five common household objects (spray, pringles, tincase, sandwich, and mouse), we acquire 166.4k RGB and 16.7k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset demonstrate that applying multimodal sensor control at test time yields substantial performance gains, achieving a 19.5 pp improvement on pretrained generalizable models. It also enhances robustness precisely where those models tend to fail. Moreover, even instance-level pose estimators, where train and test set share identical object and background, performance still varies under environmental and sensor change, demonstrating that test-time sensor control remains effective compared to costly expansions in the quantity and diversity of real-world training data, without any additional training. SenseShift6D extends the object pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments.

SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

TL;DR

SenseShift6D tackles the real-world robustness problem in 6D pose estimation by introducing a physically captured RGB-D benchmark that orthogonally varies exposure, gain, depth-capture mode, and illumination. The study shows that test-time multimodal sensor control yields substantial accuracy gains for both pretrained generalizable models and instance-level estimators, outperforming fixed configurations without retraining and reducing performance disparities across scenes. By revealing sensor variation as a distinct axis of generalization, the work provides a foundation for sensor-aware robustness and adaptive perception systems. Overall, SenseShift6D enables systematic evaluation and development of self-tuning perception under environmental and hardware variability.

Abstract

Recent advances on 6D object-pose estimation have achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode-and the potential of test-time sensor control to mitigate such variations-largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For five common household objects (spray, pringles, tincase, sandwich, and mouse), we acquire 166.4k RGB and 16.7k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset demonstrate that applying multimodal sensor control at test time yields substantial performance gains, achieving a 19.5 pp improvement on pretrained generalizable models. It also enhances robustness precisely where those models tend to fail. Moreover, even instance-level pose estimators, where train and test set share identical object and background, performance still varies under environmental and sensor change, demonstrating that test-time sensor control remains effective compared to costly expansions in the quantity and diversity of real-world training data, without any additional training. SenseShift6D extends the object pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments.

Paper Structure

This paper contains 20 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: RGB sample images under auto exposure and under all combinations of exposure and gain settings for brightness 5%, 25%, 75% and 100%. Rows indicate gain levels and columns indicate exposure levels.
  • Figure 2: Depth sample images under four depth capture modes: default, high accuracy, high density, and medium density.
  • Figure 3: Overall standard deviation of AUC computed across all brightness and scenes evaluated under Baseline, Oracle-Fixed, Oracle-Dynamic.
  • Figure 4: Examples of RGB augmentations applied during training. The top-left image is the original, and the remaining images are randomly augmented samples generated using the settings summarized in Table \ref{['tab:aug_list']}.
  • Figure 5: Comparison of predictions under Baseline and Oracle for GigaPose. Visualized object pose on RGB images: ground truth pose in red, predicted pose in green.
  • ...and 3 more figures