Table of Contents
Fetching ...

A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

Eran Bamani, Eden Nissinman, Lisa Koenigsberg, Inbar Meir, Avishai Sintov

TL;DR

The Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes is proposed and compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model.

Abstract

Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.

A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

TL;DR

The Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes is proposed and compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model.

Abstract

Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.
Paper Structure (12 sections, 5 equations, 9 figures, 3 tables)

This paper contains 12 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Two examples of generating synthetic images in DUR with gesture and distance conditions. Across 150 iterations, Gaussian noise is transformed into a comprehensive high-fidelity image according to the desired labels.
  • Figure 2: Illustration of the Diffusion in Ultra-Range (DUR) framework. (a) Dataset $\mathcal{D}$ for training DUR is acquired by collecting labeled RGB images, cropping the user with YOLOv8, and improving image quality with HQ-Net. (b) The generation of conditional synthetic samples starts with Gaussian noise along with distance and gesture conditions. A non-Markovian diffusion process denoises the noisy image and the ResNet filter removes failed images.
  • Figure 3: Examples of failed synthetic images generated by DUR.
  • Figure 4: Visual comparison of synthetic image examples across several generative models, including DUR, and for the trained gesture classes.
  • Figure 5: Examples of correct gesture recognition with GViT trained on synthetic data from DUR (left to right): thumbs-up gesture from 10 meters distance with model certainty of 95.6%; beckoning gesture from 12 meters distance with model certainty of 95.1%; null gesture from 19 meters distance with model certainty of 93.5%; pointing gesture from 15 meters distance with model certainty of 94.7%; thumbs-down gesture from 20 meters distance with model certainty of 92.8%; and stop gesture from 17 meters distance with model certainty of 93.9%.
  • ...and 4 more figures