Table of Contents
Fetching ...

A Multi-Modal Interaction Framework for Efficient Human-Robot Collaborative Shelf Picking

Abhinav Pathak, Kalaichelvi Venkatesan, Tarek Taha, Rajkumar Muthusamy

TL;DR

The paper tackles the problem of safe, intuitive human-robot collaboration for shelf picking in warehouses. It presents a multi-modal framework that fuses perception, a physics-based Box Relations Graph with a PyBullet simulation, and an LLM-augmented decision-maker using Chain-of-Thought to plan safe extraction sequences under human guidance. Key contributions include a BRG for dependency-based safe removal, a real-to-sim pipeline for accurate physics grounding, a multimodal interaction interface, and real-world validation across gesture-guided extraction, collaborative shelf clearing, and stability assistance. The results indicate improved collaboration efficiency and safety compared to robot-only approaches, with real-time interaction and transparent LLM reasoning; the approach lays groundwork for scalable HRC in cluttered, unstructured environments.

Abstract

The growing presence of service robots in human-centric environments, such as warehouses, demands seamless and intuitive human-robot collaboration. In this paper, we propose a collaborative shelf-picking framework that combines multimodal interaction, physics-based reasoning, and task division for enhanced human-robot teamwork. The framework enables the robot to recognize human pointing gestures, interpret verbal cues and voice commands, and communicate through visual and auditory feedback. Moreover, it is powered by a Large Language Model (LLM) which utilizes Chain of Thought (CoT) and a physics-based simulation engine for safely retrieving cluttered stacks of boxes on shelves, relationship graph for sub-task generation, extraction sequence planning and decision making. Furthermore, we validate the framework through real-world shelf picking experiments such as 1) Gesture-Guided Box Extraction, 2) Collaborative Shelf Clearing and 3) Collaborative Stability Assistance.

A Multi-Modal Interaction Framework for Efficient Human-Robot Collaborative Shelf Picking

TL;DR

The paper tackles the problem of safe, intuitive human-robot collaboration for shelf picking in warehouses. It presents a multi-modal framework that fuses perception, a physics-based Box Relations Graph with a PyBullet simulation, and an LLM-augmented decision-maker using Chain-of-Thought to plan safe extraction sequences under human guidance. Key contributions include a BRG for dependency-based safe removal, a real-to-sim pipeline for accurate physics grounding, a multimodal interaction interface, and real-world validation across gesture-guided extraction, collaborative shelf clearing, and stability assistance. The results indicate improved collaboration efficiency and safety compared to robot-only approaches, with real-time interaction and transparent LLM reasoning; the approach lays groundwork for scalable HRC in cluttered, unstructured environments.

Abstract

The growing presence of service robots in human-centric environments, such as warehouses, demands seamless and intuitive human-robot collaboration. In this paper, we propose a collaborative shelf-picking framework that combines multimodal interaction, physics-based reasoning, and task division for enhanced human-robot teamwork. The framework enables the robot to recognize human pointing gestures, interpret verbal cues and voice commands, and communicate through visual and auditory feedback. Moreover, it is powered by a Large Language Model (LLM) which utilizes Chain of Thought (CoT) and a physics-based simulation engine for safely retrieving cluttered stacks of boxes on shelves, relationship graph for sub-task generation, extraction sequence planning and decision making. Furthermore, we validate the framework through real-world shelf picking experiments such as 1) Gesture-Guided Box Extraction, 2) Collaborative Shelf Clearing and 3) Collaborative Stability Assistance.

Paper Structure

This paper contains 13 sections, 9 figures, 2 tables, 2 algorithms.

Figures (9)

  • Figure 1: Collaborative Shelf Picking: This illustration depicts a mobile manipulator robot powered by a LLM and a novel physics-based reasoning engine collaborating with the human in real-time
  • Figure 2: Overview of the proposed grasping pipeline using the physics-aware approach for safe cardboard box extraction
  • Figure 3: Real-to-Sim Pipeline
  • Figure 4: Illustration showing how a Box Relations Graph is computed from a simulation
  • Figure 5: Pointing Recognition Pipeline
  • ...and 4 more figures