Table of Contents
Fetching ...

Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim

Martina Hutter-Mironovova

Abstract

This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.

Sim-to-Real Fruit Detection Using Synthetic Data: Quantitative Evaluation and Embedded Deployment with Isaac Sim

Abstract

This study investigates the effectiveness of synthetic data for sim-to-real transfer in object detection under constrained data conditions and embedded deployment requirements. Synthetic datasets were generated in NVIDIA Isaac Sim and combined with limited real-world fruit images to train YOLO-based detection models under real-only, synthetic-only, and hybrid regimes. Performance was evaluated on two test datasets: an in-domain dataset with conditions matching the training data and a domain shift dataset containing real fruit and different background conditions. Results show that models trained exclusively on real data achieve the highest accuracy, while synthetic-only models exhibit reduced performance due to a domain gap. Hybrid training strategies significantly improve performance compared to synthetic-only approaches and achieve results close to real-only training while reducing the need for manual annotation. Under domain shift conditions, all models show performance degradation, with hybrid models providing improved robustness. The trained models were successfully deployed on a Jetson Orin NX using TensorRT optimization, achieving real-time inference performance. The findings highlight that synthetic data is most effective when used in combination with real data and that deployment constraints must be considered alongside detection accuracy.

Paper Structure

This paper contains 31 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison between synthetic and real-world data. (a) Example image generated in NVIDIA Isaac Sim. (b) Real-world image from the train dataset. Differences in texture, lighting, and object appearance illustrate the domain gap between simulated and real environments.
  • Figure 2: General architecture of a YOLO-based object detection network. The model consists of a backbone for feature extraction, a neck for multi-scale feature aggregation, and a detection head that predicts bounding boxes and class probabilities in a single forward pass. Adapted from Redmon and Farhadi redmon-2018.
  • Figure 3: Overview of the evaluated training regimes and common evaluation protocol. Synthetic-only, real-only, and hybrid models were trained using different combinations of synthetic and real training images, while all models were evaluated on the same fixed real-world test set.
  • Figure 4: (a) Detection performance (mAP@0.5) across different training regimes for in-domain (T1) and domain shift (T2) test datasets. The results show a significant performance drop under domain shift, while hybrid training improves robustness. (b) Detection performance (mAP@0.5:0.95) across different training regimes for in-domain (T1) and domain shift (T2) test datasets. The results confirm reduced performance under domain shift and highlight the benefit of hybrid training strategies.
  • Figure 5: Qualitative detection results under different training regimes. (a) In-domain example using hybrid training (H1000+50) evaluated on T1-100. (b) Domain shift example using the same model evaluated on T2-100. (c) Failure case of a synthetic-only model (S1000) under domain shift. (d) Improved detection on the same scene using hybrid training (H1000+50).