Table of Contents
Fetching ...

Synthesizing multi-log grasp poses in cluttered environments

Arvid Fälldin, Tommy Löfstedt, Tobias Semberg, Erik Wallin, Martin Servin

TL;DR

This work tackles the problem of multi-object grasp synthesis in cluttered environments by leveraging synthetic data generated with physics-based simulation to train a U-Net that predicts per-pixel grasp maps conditioned on target logs. By encoding grasp pose information as $(x,y,\phi,w,q)$ and expanding grasp quality with a flexible objective function, the approach can prioritize both graspability and the number/balance of grasped logs. The method demonstrates strong performance in simulation, including robustness to obstacles and packed-pile scenarios, and shows potential for real-system transfer with domain adaptation. Practically, this enables more energy- and cost-efficient automated handling of log piles by forwarders, while highlighting the need for domain-randomization and adaptive control for real-world deployment.

Abstract

Multi-object grasping is a challenging task. It is important for energy and cost-efficient operation of industrial crane manipulators, such as those used to collect tree logs from the forest floor and on forest machines. In this work, we used synthetic data from physics simulations to explore how data-driven modeling can be used to infer multi-object grasp poses from images. We showed that convolutional neural networks can be trained specifically for synthesizing multi-object grasps. Using RGB-Depth images and instance segmentation masks as input, a U-Net model outputs grasp maps with the corresponding grapple orientation and opening width. Given an observation of a pile of logs, the model can be used to synthesize and rate the possible grasp poses and select the most suitable one, with the possibility to respect changing operational constraints such as lift capacity and reach. When tested on previously unseen data, the proposed model found successful grasp poses with an accuracy up to 96%.

Synthesizing multi-log grasp poses in cluttered environments

TL;DR

This work tackles the problem of multi-object grasp synthesis in cluttered environments by leveraging synthetic data generated with physics-based simulation to train a U-Net that predicts per-pixel grasp maps conditioned on target logs. By encoding grasp pose information as and expanding grasp quality with a flexible objective function, the approach can prioritize both graspability and the number/balance of grasped logs. The method demonstrates strong performance in simulation, including robustness to obstacles and packed-pile scenarios, and shows potential for real-system transfer with domain adaptation. Practically, this enables more energy- and cost-efficient automated handling of log piles by forwarders, while highlighting the need for domain-randomization and adaptive control for real-world deployment.

Abstract

Multi-object grasping is a challenging task. It is important for energy and cost-efficient operation of industrial crane manipulators, such as those used to collect tree logs from the forest floor and on forest machines. In this work, we used synthetic data from physics simulations to explore how data-driven modeling can be used to infer multi-object grasp poses from images. We showed that convolutional neural networks can be trained specifically for synthesizing multi-object grasps. Using RGB-Depth images and instance segmentation masks as input, a U-Net model outputs grasp maps with the corresponding grapple orientation and opening width. Given an observation of a pile of logs, the model can be used to synthesize and rate the possible grasp poses and select the most suitable one, with the possibility to respect changing operational constraints such as lift capacity and reach. When tested on previously unseen data, the proposed model found successful grasp poses with an accuracy up to 96%.
Paper Structure (26 sections, 5 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Left: A pile of logs jammed in between a rock and a tree stump. Top right: A forwarder grasping logs in cluttered environment. Bottom right: Log piles at a clearcut area.
  • Figure 2: Model overview. 1) A sample is drawn from the pile database. 2) A rule-based algorithm is used to generate grasp candidates. 3) The grasp candidates are tested in simulation. 4) The pile is annotated with the successful grasps. 5) Grasp annotations are converted into target arrays. 6) Each log in the pile is segmented individually. 7) Individual masks are combined into target masks. With four logs, there are $2^4-1=15$ possible target subsets to consider. 8) An RGB-Depth image and a target mask are used as input. 9--10) A U-Net model is trained to predict the target variables from Step 5. Steps 2--5 and 8--10 are repeated for each of the 15 target masks. 11) The model output is used to compute the predicted grasp quality in each pixel. We search over all pixels in each of the 15 $Q$ maps and pick the grasp that maximizes the quality. 12) The chosen grasp is tested in simulation.
  • Figure 3: Definition of the grapple width, $w$, grapple orientation, $\phi$, and balance angle, $\beta$. The simplified collision geometry of the grapple and tree logs are drawn with orange lines. Cyan arcs show the rotators's unactuated joints. Examples of obstacle geometries are shown on the right.
  • Figure 4: Example of a simulated grasp. The grapple is lowered over the target logs (white) using a grasp pose (position, orientation, opening width) that avoids collision with the surrounding logs, stumps and rocks (colored).
  • Figure 5: Illustration of how annotated grasps are encoded into 2D arrays. Top left: The thick yellow lines show the position of the grapple claw tips, and the red areas show the rectangle that we used to encode the grasp parameters in the target arrays. The bottom row shows an example output from the neural network.
  • ...and 9 more figures