Table of Contents
Fetching ...

Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning

Lun Li, Hamidreza Kasaei

TL;DR

This work tackles occlusion in robotic harvesting by introducing an end-to-end imitation-learning viewpoint planner that continuously adjusts a camera in 6-DoF to reveal occluded crops. The approach leverages Action Chunking with Transformer (ACT) to predict action chunks from RGB-D and pose-change observations, trained via behavior cloning on expert demonstrations collected in Gazebo. In simulation, the planner achieves an 86.7% success rate with rapid 3.1 s planning, and generalizes across eight fruit types; real-world tests yield 66.7% success due to clutter and lighting variations, demonstrating practical potential. Overall, the study provides a data-efficient, generalizable LfD solution for occlusion-aware view planning that enhances autonomous harvesting performance and productivity, with real-time closed-loop control at 10 Hz.

Abstract

In agricultural automation, inherent occlusion presents a major challenge for robotic harvesting. We propose a novel imitation learning-based viewpoint planning approach to actively adjust camera viewpoint and capture unobstructed images of the target crop. Traditional viewpoint planners and existing learning-based methods, depend on manually designed evaluation metrics or reward functions, often struggle to generalize to complex, unseen scenarios. Our method employs the Action Chunking with Transformer (ACT) algorithm to learn effective camera motion policies from expert demonstrations. This enables continuous six-degree-of-freedom (6-DoF) viewpoint adjustments that are smoother, more precise and reveal occluded targets. Extensive experiments in both simulated and real-world environments, featuring agricultural scenarios and a 6-DoF robot arm equipped with an RGB-D camera, demonstrate our method's superior success rate and efficiency, especially in complex occlusion conditions, as well as its ability to generalize across different crops without reprogramming. This study advances robotic harvesting by providing a practical "learn from demonstration" (LfD) solution to occlusion challenges, ultimately enhancing autonomous harvesting performance and productivity.

Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning

TL;DR

This work tackles occlusion in robotic harvesting by introducing an end-to-end imitation-learning viewpoint planner that continuously adjusts a camera in 6-DoF to reveal occluded crops. The approach leverages Action Chunking with Transformer (ACT) to predict action chunks from RGB-D and pose-change observations, trained via behavior cloning on expert demonstrations collected in Gazebo. In simulation, the planner achieves an 86.7% success rate with rapid 3.1 s planning, and generalizes across eight fruit types; real-world tests yield 66.7% success due to clutter and lighting variations, demonstrating practical potential. Overall, the study provides a data-efficient, generalizable LfD solution for occlusion-aware view planning that enhances autonomous harvesting performance and productivity, with real-time closed-loop control at 10 Hz.

Abstract

In agricultural automation, inherent occlusion presents a major challenge for robotic harvesting. We propose a novel imitation learning-based viewpoint planning approach to actively adjust camera viewpoint and capture unobstructed images of the target crop. Traditional viewpoint planners and existing learning-based methods, depend on manually designed evaluation metrics or reward functions, often struggle to generalize to complex, unseen scenarios. Our method employs the Action Chunking with Transformer (ACT) algorithm to learn effective camera motion policies from expert demonstrations. This enables continuous six-degree-of-freedom (6-DoF) viewpoint adjustments that are smoother, more precise and reveal occluded targets. Extensive experiments in both simulated and real-world environments, featuring agricultural scenarios and a 6-DoF robot arm equipped with an RGB-D camera, demonstrate our method's superior success rate and efficiency, especially in complex occlusion conditions, as well as its ability to generalize across different crops without reprogramming. This study advances robotic harvesting by providing a practical "learn from demonstration" (LfD) solution to occlusion challenges, ultimately enhancing autonomous harvesting performance and productivity.

Paper Structure

This paper contains 12 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: We proposed an end-to-end imitation learning-based viewpoint planning method to address the challenge of identifying ideal observation viewpoints for target crops in occluded robotic harvesting environments. Unlike existing approaches that rely on hand-engineered features, our model learns viewpoint planning policies directly from human expert demonstrations. The robot adjusts its camera pose in a continuous 6-DoF action space, enabling precise viewpoint adjustments. This leads to better visibility of occluded crops, improving fruit detection and overall harvesting efficiency across various agricultural environments.
  • Figure 2: We collected expert demonstration data in the Gazebo simulator as follows: For the target crop, pepper, we uniformly generated 50 initial viewpoints around it at a certain density. Starting from each initial viewpoint, an expert observed the camera's current image through the ROS RViz interface and used a custom-designed control panel to decide the corresponding 6-DoF continuous motion commands. During this process, we recorded the sequence of images captured by the camera, the camera's pose trajectory, and the smoothed continuous action commands. Together, these components form our expert demonstration dataset.
  • Figure 3: Our network takes both RGB-D images and camera pose changes as input, and outputs refined 6-DoF camera movements. It consists of two components: a CVAE encoder and a CVAE decoder. The CVAE encoder includes a transformer encoder, which operates only during training, processing the input and generating the style variable $z$. During inference, it is discarded, and $z$ is set to a zero vector. The CVAE decoder consists of a transformer encoder and a transformer decoder, which collaboratively generate the action chunk.
  • Figure 4: We built the training and experimental simulation environment using ROS and Gazebo. It incorporates a UR5e robotic arm equipped with an RGB-D camera at the end-effector, alongside a pepper plant and eight different types of fruit.
  • Figure 5: Our model significantly improved bananas detection confidence from 0.30 in occluded conditions to 0.62 after adjusting to a non-occluded viewpoint. For oranges, which were undetected in the occluded scenario, the confidence score increased to 0.87 following viewpoint adjustment.
  • ...and 1 more figures