World-Model-Based Control for Industrial box-packing of Multiple Objects using NewtonianVAE

Yusuke Kato; Ryo Okumura; Tadahiro Taniguchi

World-Model-Based Control for Industrial box-packing of Multiple Objects using NewtonianVAE

Yusuke Kato, Ryo Okumura, Tadahiro Taniguchi

TL;DR

This paper tackles high-precision, sequential industrial box-packing with varying objects by introducing an in-hand-view RGB–based NewtonianVAE (ihVS-NVAE). The method leverages a world-model latent space where latent states $x_t$, $z$, and goal states $x_g$ are inferred from hand and in-hand views, enabling placement decisions without retraining for new products. It demonstrates superior positioning accuracy and robust sequential packing on a real robot compared with state-of-the-art baselines, while requiring annotation-free data collection by factory workers. The work advances practical automation in manufacturing by combining RGB in-hand sensing with a cropped latent space that supports repeated, sequential tasks in real-world settings.

Abstract

The process of industrial box-packing, which involves the accurate placement of multiple objects, requires high-accuracy positioning and sequential actions. When a robot is tasked with placing an object at a specific location with high accuracy, it is important not only to have information about the location of the object to be placed, but also the posture of the object grasped by the robotic hand. Often, industrial box-packing requires the sequential placement of identically shaped objects into a single box. The robot's action should be determined by the same learned model. In factories, new kinds of products often appear and there is a need for a model that can easily adapt to them. Therefore, it should be easy to collect data to train the model. In this study, we designed a robotic system to automate real-world industrial tasks, employing a vision-based learning control model. We propose in-hand-view-sensitive Newtonian variational autoencoder (ihVS-NVAE), which employs an RGB camera to obtain in-hand postures of objects. We demonstrate that our model, trained for a single object-placement task, can handle sequential tasks without additional training. To evaluate efficacy of the proposed model, we employed a real robot to perform sequential industrial box-packing of multiple objects. Results showed that the proposed model achieved a 100% success rate in industrial box-packing tasks, thereby outperforming the state-of-the-art and conventional approaches, underscoring its superior effectiveness and potential in industrial tasks.

World-Model-Based Control for Industrial box-packing of Multiple Objects using NewtonianVAE

TL;DR

, and goal states

are inferred from hand and in-hand views, enabling placement decisions without retraining for new products. It demonstrates superior positioning accuracy and robust sequential packing on a real robot compared with state-of-the-art baselines, while requiring annotation-free data collection by factory workers. The work advances practical automation in manufacturing by combining RGB in-hand sensing with a cropped latent space that supports repeated, sequential tasks in real-world settings.

Abstract

Paper Structure (19 sections, 8 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 8 equations, 8 figures, 1 table, 1 algorithm.

INTRODUCTION
PRELIMINARY CONCEPTS
NewtonianVAE
Goal State Prediction
In-hand-View-Sensitive Newtonian VAE
Proximity Camera for In-hand Object Posture
Utilization of Latent Space Characteristics
Sequential Industrial Box-packing of Multiple objects
Experiment
Hardware Configuration
Data Collection
Training
EVALUATION
Visualization of the Latent Space
Positioning Performance
...and 4 more sections

Figures (8)

Figure 1: (a) Overview of our robotic system used in industrial box-packing. The robot is equipped with a vacuum gripper to pick up objects. Two RGB cameras are attached to the robotic hand. One obtains images from the box inside, named as "hand camera" and the other obtains images of the picked object posture, named as "in-hand camera". Latent states $x_t$ and $z$ are inferred from hand camera and in-hand camera images, respectively. An insertion position $x_g$ is generated from $z$. A vacuum gripper is positioned in the placing position by proportional control in the latent space, even with vacuum pose variations. (b) Industrial box-packing of multiple objects is our target task in this paper. This task is well known in real manufacturing plants. The initial box-packing state is shown on the left image. The robotic system developed to perform the box-packing task of multiple objects is shown on the right image.
Figure 2: In the VAE encoder, if the input images exhibit a similar appearance, the estimated latent variables will also take similar values. Thus, the captured images were cropped such that the model could treat them in the same state, even if they were actually in different states.
Figure 3: The world model concept can be applied to perform industrial box-packing tasks. During the data collection phase, hand and in-hand images are collected at stage $n$. After training, the model executes the box-packing task at stage $n$. The model also executes the task at stage $n+1$ because the environment is observed locally, and the observed images are similar at each stage.
Figure 4: Objects used in the box-packing experiment. (a) LED package with size 50 mm $\times$ 100 mm $\times$ 50 mm. (b) Cable package with size 80 mm $\times$ 210 mm $\times$ 50 mm. Each box can store up to five objects. Compared to Cable packages, LED packages are smaller and have a wider margin within the box.
Figure 5: Data collection process. Initially, the robot vacuums an object with a random pose. Thereafter, it picks up the object in an upward direction. At this instant, $\mathbf{I}_z$ and $\mathbf{I}_0$ are observed. $\mathbf{I}_0$ is treated as $\mathbf{I}_g$ during the training of the placing-position estimator $p(\mathbf{x}_g\vert\mathbf{z})$. Subsequently, the robot moves randomly above a box to collect the transition data at each time step $t$; time horizon H: 20, number of episodes: 60.
...and 3 more figures

World-Model-Based Control for Industrial box-packing of Multiple Objects using NewtonianVAE

TL;DR

Abstract

World-Model-Based Control for Industrial box-packing of Multiple Objects using NewtonianVAE

Authors

TL;DR

Abstract

Table of Contents

Figures (8)