Table of Contents
Fetching ...

LLM-Pack: Intuitive Grocery Handling for Logistics Applications

Yannik Blei, Michael Krawez, Tobias Jülg, Pierre Krack, Florian Walter, Wolfram Burgard

TL;DR

LLM-Pack presents an open vocabulary grocery packing framework that fuses vision language perception, planning with large language models, and GroundedSAM based execution to preserve product integrity. The method introduces the Packing Consistency Score $C$ to quantify human like packing sequences and provides a Grocery Packing Dataset for evaluation, demonstrating strong planning performance with GPT-4.5 and robust end to end results on a Franka robot. The key contributions include a zero training requirement for new grocery items, modular design for model upgrades, and publicly released dataset and code to enable reproducibility and future work. The results indicate practical potential for service robotics in grocery settings, with future directions including enhanced execution with VLAs and improved handling of free space within packing containers.

Abstract

Robotics and automation are increasingly influential in logistics but remain largely confined to traditional warehouses. In grocery retail, advancements such as cashier-less supermarkets exist, yet customers still manually pick and pack groceries. While there has been a substantial focus in robotics on the bin picking problem, the task of packing objects and groceries has remained largely untouched. However, packing grocery items in the right order is crucial for preventing product damage, e.g., heavy objects should not be placed on top of fragile ones. However, the exact criteria for the right packing order are hard to define, in particular given the huge variety of objects typically found in stores. In this paper, we introduce LLM-Pack, a novel approach for grocery packing. LLM-Pack leverages language and vision foundation models for identifying groceries and generating a packing sequence that mimics human packing strategy. LLM-Pack does not require dedicated training to handle new grocery items and its modularity allows easy upgrades of the underlying foundation models. We extensively evaluate our approach to demonstrate its performance. We will make the source code of LLMPack publicly available upon the publication of this manuscript.

LLM-Pack: Intuitive Grocery Handling for Logistics Applications

TL;DR

LLM-Pack presents an open vocabulary grocery packing framework that fuses vision language perception, planning with large language models, and GroundedSAM based execution to preserve product integrity. The method introduces the Packing Consistency Score to quantify human like packing sequences and provides a Grocery Packing Dataset for evaluation, demonstrating strong planning performance with GPT-4.5 and robust end to end results on a Franka robot. The key contributions include a zero training requirement for new grocery items, modular design for model upgrades, and publicly released dataset and code to enable reproducibility and future work. The results indicate practical potential for service robotics in grocery settings, with future directions including enhanced execution with VLAs and improved handling of free space within packing containers.

Abstract

Robotics and automation are increasingly influential in logistics but remain largely confined to traditional warehouses. In grocery retail, advancements such as cashier-less supermarkets exist, yet customers still manually pick and pack groceries. While there has been a substantial focus in robotics on the bin picking problem, the task of packing objects and groceries has remained largely untouched. However, packing grocery items in the right order is crucial for preventing product damage, e.g., heavy objects should not be placed on top of fragile ones. However, the exact criteria for the right packing order are hard to define, in particular given the huge variety of objects typically found in stores. In this paper, we introduce LLM-Pack, a novel approach for grocery packing. LLM-Pack leverages language and vision foundation models for identifying groceries and generating a packing sequence that mimics human packing strategy. LLM-Pack does not require dedicated training to handle new grocery items and its modularity allows easy upgrades of the underlying foundation models. We extensively evaluate our approach to demonstrate its performance. We will make the source code of LLMPack publicly available upon the publication of this manuscript.

Paper Structure

This paper contains 20 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Packing of groceries while ensuring product integrity. An image of the scene is recorded. The image is then processed by foundation models to find a suitable packing sequence for the objects in the scene.
  • Figure 2: Overview of the LLM-Pack approach: We assume that the groceries are provided in a random but separated arrangement on the table. In the perception step, we employ a Vision Language Model (VLM) to obtain an unsorted list of the objects in the scene. In the planning step, we then utilize a Large Language Model (LLM) to determine a proper packing sequence of the objects. In the final execution step, we perform the required manipulation tasks using GroundedSAM ren2024gsam.
  • Figure 3: Examples images from the "Grocery Packing Dataset" dataset. Scenes contain between 6.0 and 20.0 groceries. The dataset consists of 520.0 groceries in 40.0 scenes. We release the dataset as a part of this work.
  • Figure 4: Object recognition performance in terms of the Average F1 Score (AF1) over the Scene Size (number of grocery items). GPT-4.5 achieves the best performance on most of the grocery item counts.
  • Figure 5: Average packing consistency score (AC) over the scene size (number of grocery items) for different Large Language Models (LLMs). C describes the quality of the proposed grocery packing sequence according to human annotated ground truth. Models show increasing packing quality with growing parameter count. OpenAIs GPT-4.5, GPT-4o and o3-mini models show superior performance.
  • ...and 1 more figures