Table of Contents
Fetching ...

Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation

Han Xue, Nan Min, Xiaotong Liu, Wendi Chen, Yuan Fang, Jun Lv, Cewu Lu, Chuan Wen

TL;DR

This first comprehensive empirical study rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning reveals that the wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment.

Abstract

The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on https://robo-fisheye.github.io/

Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation

TL;DR

This first comprehensive empirical study rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning reveals that the wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment.

Abstract

The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on https://robo-fisheye.github.io/
Paper Structure (37 sections, 20 figures, 12 tables)

This paper contains 37 sections, 20 figures, 12 tables.

Figures (20)

  • Figure 1: Overview of the four factors analyzed to address our Research Questions (RQs). We study: (a) Camera Model (fisheye vs. pinhole) as our primary comparison; (b) Scene Complexity (poor vs. rich) for spatial localization (RQ1); (c) Scene Diversity (1 vs. N scenes) for scene generalization (RQ2); and (d) Camera Parameters (varied intrinsics) for hardware generalization (RQ3).
  • Figure 2: The implementation pipeline of fisheye camera simulation in MuJoCo todorov2012mujoco.
  • Figure 3: Random Crop Augmentation (fixed scale) vs. Random Scale Augmentation (RSA) for cross-camera generalization.
  • Figure 4: The six tasks in simulation experiments.
  • Figure 5: (a) The real-world experiment setup, which includes changeable backgrounds for scene complexity (RQ1) and scene generalization (RQ2) experiments. (b) The three tasks in real-world experiments: Pick Cup, Fold Towel and Hang Chinese Knot.
  • ...and 15 more figures