Table of Contents
Fetching ...

Compact Task-Aligned Imitation Learning for Laboratory Automation

Kanata Suzuki, Hanon Nakamurama, Kana Miyamoto, Tetsuya Ogata

TL;DR

A compact imitation learning framework for laboratory automation using small foundation models that aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert.

Abstract

Robotic laboratory automation has traditionally relied on carefully engineered motion pipelines and task-specific hardware interfaces, resulting in high design cost and limited flexibility. While recent imitation learning techniques can generate general robot behaviors, their large model sizes often require high-performance computational resources, limiting applicability in practical laboratory environments. In this study, we propose a compact imitation learning framework for laboratory automation using small foundation models. The proposed method, TVF-DiT, aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert. The entire model consists of fewer than 500M parameters, enabling inference on low-VRAM GPUs. Experiments on three real-world laboratory tasks - test tube cleaning, test tube arrangement, and powder transfer - demonstrate an average success rate of 86.6%, significantly outperforming alternative lightweight baselines. Furthermore, detailed task prompts improve vision-language alignment and task performance. These results indicate that small foundation models, when properly aligned and integrated with diffusion-based policy learning, can effectively support practical laboratory automation with limited computational resources.

Compact Task-Aligned Imitation Learning for Laboratory Automation

TL;DR

A compact imitation learning framework for laboratory automation using small foundation models that aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert.

Abstract

Robotic laboratory automation has traditionally relied on carefully engineered motion pipelines and task-specific hardware interfaces, resulting in high design cost and limited flexibility. While recent imitation learning techniques can generate general robot behaviors, their large model sizes often require high-performance computational resources, limiting applicability in practical laboratory environments. In this study, we propose a compact imitation learning framework for laboratory automation using small foundation models. The proposed method, TVF-DiT, aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert. The entire model consists of fewer than 500M parameters, enabling inference on low-VRAM GPUs. Experiments on three real-world laboratory tasks - test tube cleaning, test tube arrangement, and powder transfer - demonstrate an average success rate of 86.6%, significantly outperforming alternative lightweight baselines. Furthermore, detailed task prompts improve vision-language alignment and task performance. These results indicate that small foundation models, when properly aligned and integrated with diffusion-based policy learning, can effectively support practical laboratory automation with limited computational resources.
Paper Structure (20 sections, 1 equation, 6 figures, 3 tables)

This paper contains 20 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of this study for a laboratory automation framework with a compact imitation learning model. We focus on robotic test tube manipulation under limited computational resources.
  • Figure 2: Architecture of the proposed TVF-DiT framework. DINOv3 and SigLIP2 extract geometric and language-aligned representations, which are fused via a lightweight Adapter. The resulting task-conditioned tokens are used as cross-attention keys and values in a Diffusion Transformer that predicts action chunks through conditional flow matching.
  • Figure 3: Task 1: Test tube cleaning requiring precise insertion and continuous scrubbing motion. Task 2: Test tube arrangement requiring collision avoidance and bimanual coordination. Task 3: Powder transfer requiring sequential scooping and pouring manipulation. These tasks evaluate geometric precision, object selection, and continuous action generation.
  • Figure 4: Representative execution sequences generated by the proposed method. Successful trials demonstrate continuous and coordinated manipulation across all tasks. Failure cases (highlighted) mainly occur during fine alignment or placement, indicating sensitivity to small geometric errors.
  • Figure 5: Joint trajectories and 3D end-effector (EE) paths during inference. Consistent EE paths across episodes demonstrate robustness of the learned policy.
  • ...and 1 more figures