THOR2: Topological Analysis for 3D Shape and Color-Based Human-Inspired Object Recognition in Unseen Environments
Ekta U. Samani, Ashis G. Banerjee
TL;DR
THOR2 tackles robust RGB-D object recognition in unseen clutter by extending the topology-based TOPS descriptor with TOPS2 color embeddings learned from coarse color regions via the Mapper algorithm. By combining persistence-based 3D shape information with topologically informed color embeddings, THOR2 achieves state-of-the-art results on OCID and UW-IS Occluded benchmarks using synthetic training alone, and demonstrates resilience to sim2real gaps. The framework uses a streamlined two-classifier setup (one for TOPS and one for TOPS2), a view-normalization step, and occlusion-handling during testing, enabling strong performance with fewer real-world training data. A LoCoBot deployment confirms practical applicability, with THOR2 processing at about 0.7 s per frame and benefiting from interpretable topological representations for robust object recognition in low-cost robotic systems.
Abstract
Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. This study presents a 3D shape and color-based descriptor, TOPS2, for point clouds generated from RGB-D images and an accompanying recognition framework, THOR2. The TOPS2 descriptor embodies object unity, a human cognition mechanism, by retaining the slicing-based topological representation of 3D shape from the TOPS descriptor while capturing object color information through slicing-based color embeddings computed using a network of coarse color regions. These color regions, analogous to the MacAdam ellipses identified in human color perception, are obtained using the Mapper algorithm, a topological soft-clustering technique. THOR2, trained using synthetic data, demonstrates markedly improved recognition accuracy compared to THOR, its 3D shape-based predecessor, on two benchmark real-world datasets: the OCID dataset capturing cluttered scenes from different viewpoints and the UW-IS Occluded dataset reflecting different environmental conditions and degrees of object occlusion recorded using commodity hardware. THOR2 also outperforms baseline deep learning networks, and a widely-used Vision Transformer (ViT) adapted for RGB-D inputs trained using synthetic and limited real-world data on both the datasets. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots.
