Table of Contents
Fetching ...

Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

Yintao Ma, Sajjad Pakdamansavoji, Amir Rasouli, Tongtong Cao

TL;DR

Box6D addresses the challenge of 6D pose estimation for warehouse boxes under clutter and occlusion by adopting a zero-shot, category-level approach. It extends a category CAD template with a per-axis binary-search for dimension estimation, incorporates a depth-consistency filter to resolve symmetry ambiguities, and employs an early-stopping mechanism to dramatically reduce computation while preserving accuracy. The method achieves competitive or superior pose precision on warehouse and public benchmarks (HouseCat6D and PACE) and reduces inference time by approximately 76%, enabling real-time robotic manipulation. Overall, Box6D demonstrates scalable, low-maintenance category-level pose estimation tailored to warehouse environments, bridging the gap between model-based accuracy and model-free flexibility.

Abstract

Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.

Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

TL;DR

Box6D addresses the challenge of 6D pose estimation for warehouse boxes under clutter and occlusion by adopting a zero-shot, category-level approach. It extends a category CAD template with a per-axis binary-search for dimension estimation, incorporates a depth-consistency filter to resolve symmetry ambiguities, and employs an early-stopping mechanism to dramatically reduce computation while preserving accuracy. The method achieves competitive or superior pose precision on warehouse and public benchmarks (HouseCat6D and PACE) and reduces inference time by approximately 76%, enabling real-time robotic manipulation. Overall, Box6D demonstrates scalable, low-maintenance category-level pose estimation tailored to warehouse environments, bridging the gap between model-based accuracy and model-free flexibility.

Abstract

Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.

Paper Structure

This paper contains 19 sections, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: A typical storage scenario in which the robot is tasked with picking up a box. The green line highlights the target box to be manipulated. Before grasping, the robot must accurately predict the pose and scale of the observed box.
  • Figure 2: Overview of the Box6D pipeline. The observed RGB–D image and a rendered RGB–D category box template are first used for detection and segmentation. The detections are passed to the pose estimator to generate multiple pose hypotheses, reject those with depth inconsistency, and refine the high-confidence candidates. The resulting pose guides dimension estimation by comparing the projected CAD mask to the observation; the CAD is rescaled accordingly and fed back to pose estimation. This project–compare–rescale loop repeats until convergence or terminates early via the early-stopping module.
  • Figure 3: An example of how depth-consistency filter improves 6D pose precision. (Top left): A rejected pose hypothesis produces inconsistent depth, protruding from the stack of boxes. (Top right): Centers of all pose hypotheses, shown as red points; those with inconsistent depth lie outside the box stack. (Bottom left): Centers of the remaining pose hypotheses after the depth-consistency filter, shown as red points; the centers now lie within the box stack. (Bottom right): Correct 6D pose estimated after applying the depth-consistency filter. Estimated pose is marked with red and ground-truth is marked with green.
  • Figure 4: An example of scaling the template box to align with the target box. (Top): The template box (green) is mismatched in scale with the target box (black). We compare the observed mask of the target and the projected mask of the template along all axes using their pixel extents, indicated by red, green, and blue lines. (Bottom): After iterative comparison, rescaling, and pose refresh, convergence is reached: the mask extents along all axes fall within a specified threshold, and the scale is correctly estimated.
  • Figure 5: An example of estimating the target box scale using a closed-form update when the early-stopping criterion is satisfied. (Left:) The template box (green) and target box (black) poses are aligned at a vertex, and the only remaining difference is scale. The scale mismatch is indicated by red lines. (Right:) The same scale difference is visualized on the observed and projected masks of the target and template boxes with red lines. The ratio of these red-line extents is used to solve for the scale.
  • ...and 3 more figures