Table of Contents
Fetching ...

THOR2: Topological Analysis for 3D Shape and Color-Based Human-Inspired Object Recognition in Unseen Environments

Ekta U. Samani, Ashis G. Banerjee

TL;DR

THOR2 tackles robust RGB-D object recognition in unseen clutter by extending the topology-based TOPS descriptor with TOPS2 color embeddings learned from coarse color regions via the Mapper algorithm. By combining persistence-based 3D shape information with topologically informed color embeddings, THOR2 achieves state-of-the-art results on OCID and UW-IS Occluded benchmarks using synthetic training alone, and demonstrates resilience to sim2real gaps. The framework uses a streamlined two-classifier setup (one for TOPS and one for TOPS2), a view-normalization step, and occlusion-handling during testing, enabling strong performance with fewer real-world training data. A LoCoBot deployment confirms practical applicability, with THOR2 processing at about 0.7 s per frame and benefiting from interpretable topological representations for robust object recognition in low-cost robotic systems.

Abstract

Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. This study presents a 3D shape and color-based descriptor, TOPS2, for point clouds generated from RGB-D images and an accompanying recognition framework, THOR2. The TOPS2 descriptor embodies object unity, a human cognition mechanism, by retaining the slicing-based topological representation of 3D shape from the TOPS descriptor while capturing object color information through slicing-based color embeddings computed using a network of coarse color regions. These color regions, analogous to the MacAdam ellipses identified in human color perception, are obtained using the Mapper algorithm, a topological soft-clustering technique. THOR2, trained using synthetic data, demonstrates markedly improved recognition accuracy compared to THOR, its 3D shape-based predecessor, on two benchmark real-world datasets: the OCID dataset capturing cluttered scenes from different viewpoints and the UW-IS Occluded dataset reflecting different environmental conditions and degrees of object occlusion recorded using commodity hardware. THOR2 also outperforms baseline deep learning networks, and a widely-used Vision Transformer (ViT) adapted for RGB-D inputs trained using synthetic and limited real-world data on both the datasets. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots.

THOR2: Topological Analysis for 3D Shape and Color-Based Human-Inspired Object Recognition in Unseen Environments

TL;DR

THOR2 tackles robust RGB-D object recognition in unseen clutter by extending the topology-based TOPS descriptor with TOPS2 color embeddings learned from coarse color regions via the Mapper algorithm. By combining persistence-based 3D shape information with topologically informed color embeddings, THOR2 achieves state-of-the-art results on OCID and UW-IS Occluded benchmarks using synthetic training alone, and demonstrates resilience to sim2real gaps. The framework uses a streamlined two-classifier setup (one for TOPS and one for TOPS2), a view-normalization step, and occlusion-handling during testing, enabling strong performance with fewer real-world training data. A LoCoBot deployment confirms practical applicability, with THOR2 processing at about 0.7 s per frame and benefiting from interpretable topological representations for robust object recognition in low-cost robotic systems.

Abstract

Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. This study presents a 3D shape and color-based descriptor, TOPS2, for point clouds generated from RGB-D images and an accompanying recognition framework, THOR2. The TOPS2 descriptor embodies object unity, a human cognition mechanism, by retaining the slicing-based topological representation of 3D shape from the TOPS descriptor while capturing object color information through slicing-based color embeddings computed using a network of coarse color regions. These color regions, analogous to the MacAdam ellipses identified in human color perception, are obtained using the Mapper algorithm, a topological soft-clustering technique. THOR2, trained using synthetic data, demonstrates markedly improved recognition accuracy compared to THOR, its 3D shape-based predecessor, on two benchmark real-world datasets: the OCID dataset capturing cluttered scenes from different viewpoints and the UW-IS Occluded dataset reflecting different environmental conditions and degrees of object occlusion recorded using commodity hardware. THOR2 also outperforms baseline deep learning networks, and a widely-used Vision Transformer (ViT) adapted for RGB-D inputs trained using synthetic and limited real-world data on both the datasets. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots.
Paper Structure (32 sections, 6 equations, 9 figures, 9 tables)

This paper contains 32 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: An illustration of the Mapper algorithm applied to the sample two-dimensional point cloud. In this example, the height function is used as the lens function, $f_l$, to build the cover in a one-dimensional space. The dotted boxes indicate clusters in the corresponding refined pullback cover of the point cloud. The clusters are collapsed into vertices, and intersections between them are represented by edges in the output graph. Source: modified from chazal2021introduction.
  • Figure 2: Generation of a network of coarse color regions using the Mapper algorithm. A chroma and hue-based lens function projects colors from the CIELAB space to a two-dimensional space. Clustering is performed in each of the overlapping intervals of the cover to obtain a refined pullback cover whose nerve represents the color network.
  • Figure 3: Proposed framework, THOR2, for 3D shape and color-based recognition using object unity, facilitated by the similarity in the TOPS and TOPS2 descriptors of unoccluded and occluded objects. In this illustration, the persistence images and color embeddings for the first three slices of the mustard bottle denoted using the purple, blue, and green boxes are similar across the training and test stages; only the persistence image and color embedding corresponding to the last slice of the mustard bottle (which is affected by occlusion) are different.
  • Figure 4: Computation of the TOPS2 descriptor from the slices of an aligned object point cloud, $\hat{\mathcal{P}}$. The similarity matrix, $\Delta$, representing the color network obtained using the Mapper algorithm, is used to compute color embeddings for every slice of the point cloud. The resulting color embeddings are vectorized (shown using patterned rectangles) and interleaved with the vectorized persistence images (solid rectangles) to obtain the TOPS2 descriptor.
  • Figure 5: Sample results from the OCID dataset sequences recorded using the lower camera (green and red boxes represent correct and incorrect recognition, respectively). The three rows show results on sequences with curved, cuboidal, and mixed objects placed on a table.
  • ...and 4 more figures