Table of Contents
Fetching ...

A Dataset and Benchmarks for Deep Learning-Based Optical Microrobot Pose and Depth Perception

Lan Wei, Dandan Zhang

TL;DR

This work presents OTMR, the first public dataset tailored for microrobot pose and depth perception under optical microscopy, comprising 232,881 images across 18 robot designs and 176 out-of-plane poses. The authors benchmark eight deep-learning models, including Vision Transformers and NAS-optimised architectures, on pose classification and depth regression, showing ViT yields the best pose accuracy while deeper models enhance depth estimation; dataset size consistently improves performance. They provide a rigorous evaluation framework with five-fold cross-validation, standard metrics, and transfer-learning studies across robot types, offering insights into model design and generalization in microscale perception. The OTMR resource, together with interpretability analyses (Grad-CAM) and data-size experiments, supports development of robust, data-driven perception pipelines and closed-loop control for optical microrobots in challenging microenvironments.

Abstract

Optical microrobots, manipulated via optical tweezers (OT), have broad applications in biomedicine. However, reliable pose and depth perception remain fundamental challenges due to the transparent or low-contrast nature of the microrobots, as well as the noisy and dynamic conditions of the microscale environments in which they operate. An open dataset is crucial for enabling reproducible research, facilitating benchmarking, and accelerating the development of perception models tailored to microscale challenges. Standardised evaluation enables consistent comparison across algorithms, ensuring objective benchmarking and facilitating reproducible research. Here, we introduce the OpTical MicroRobot dataset (OTMR), the first publicly available dataset designed to support microrobot perception under the optical microscope. OTMR contains 232,881 images spanning 18 microrobot types and 176 distinct poses. We benchmarked the performance of eight deep learning models, including architectures derived via neural architecture search (NAS), on two key tasks: pose classification and depth regression. Results indicated that Vision Transformer (ViT) achieve the highest accuracy in pose classification, while depth regression benefits from deeper architectures. Additionally, increasing the size of the training dataset leads to substantial improvements across both tasks, highlighting OTMR's potential as a foundational resource for robust and generalisable microrobot perception in complex microscale environments.

A Dataset and Benchmarks for Deep Learning-Based Optical Microrobot Pose and Depth Perception

TL;DR

This work presents OTMR, the first public dataset tailored for microrobot pose and depth perception under optical microscopy, comprising 232,881 images across 18 robot designs and 176 out-of-plane poses. The authors benchmark eight deep-learning models, including Vision Transformers and NAS-optimised architectures, on pose classification and depth regression, showing ViT yields the best pose accuracy while deeper models enhance depth estimation; dataset size consistently improves performance. They provide a rigorous evaluation framework with five-fold cross-validation, standard metrics, and transfer-learning studies across robot types, offering insights into model design and generalization in microscale perception. The OTMR resource, together with interpretability analyses (Grad-CAM) and data-size experiments, supports development of robust, data-driven perception pipelines and closed-loop control for optical microrobots in challenging microenvironments.

Abstract

Optical microrobots, manipulated via optical tweezers (OT), have broad applications in biomedicine. However, reliable pose and depth perception remain fundamental challenges due to the transparent or low-contrast nature of the microrobots, as well as the noisy and dynamic conditions of the microscale environments in which they operate. An open dataset is crucial for enabling reproducible research, facilitating benchmarking, and accelerating the development of perception models tailored to microscale challenges. Standardised evaluation enables consistent comparison across algorithms, ensuring objective benchmarking and facilitating reproducible research. Here, we introduce the OpTical MicroRobot dataset (OTMR), the first publicly available dataset designed to support microrobot perception under the optical microscope. OTMR contains 232,881 images spanning 18 microrobot types and 176 distinct poses. We benchmarked the performance of eight deep learning models, including architectures derived via neural architecture search (NAS), on two key tasks: pose classification and depth regression. Results indicated that Vision Transformer (ViT) achieve the highest accuracy in pose classification, while depth regression benefits from deeper architectures. Additionally, increasing the size of the training dataset leads to substantial improvements across both tasks, highlighting OTMR's potential as a foundational resource for robust and generalisable microrobot perception in complex microscale environments.

Paper Structure

This paper contains 20 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Conceptual overview of out-of-plane pose and depth estimation for optical microrobots. The pose is defined by pitch and roll angles, while depth corresponds to the vertical displacement relative to the microscope’s focal plane. Images are captured using an optical microscope and used to train deep learning models. These models learn to estimate the microrobot’s out-of-plane pose and depth from 2D microscopy images.
  • Figure 2: Overview of the 18 microrobot types included in the OTMR dataset. For each robot, the left image shows its CAD model, and the right image presents the corresponding experimental image captured at the focus plane under an optical microscope. Microrobots 1–6 (top row) are specifically designed for the pose classification task due to their varied and distinguishable orientations, while all 18 types are used for depth estimation tasks.
  • Figure 3: Overview of the experimental platform for data collection.
  • Figure 4: Preprocessing pipeline for microrobot image preprocessing. The original microscope image (left) is first processed through contour detection (middle) to identify the microrobot boundary. A $256 \times 256$ pixel region centred on the microrobot is then cropped (right) for subsequent analysis.
  • Figure 5: Illustration of pitch and roll angles in microrobot pose estimation. The top row shows variations in pitch angle (P), representing rotation around the horizontal axis, from $0^\circ$ to $70^\circ$. The bottom row shows variations in roll angle (R), representing rotation around the vertical axis, also from $0^\circ$ to $70^\circ$.
  • ...and 4 more figures