RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Khanh Nguyen; Dasith de Silva Edirimuni; Ghulam Mubashar Hassan; Ajmal Mian

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian

TL;DR

RI-Mamba addresses the challenge of rotation-invariant text-to-shape retrieval by introducing a rotation-invariant state-space architecture for point clouds. It combines local/global reference frames, Hilbert-based patch serialization, linear-time orientational embeddings, FiLM-based feature modulation, and a bidirectional Mamba backbone, trained with cross-modal contrastive learning to scale to 200+ categories without manual annotations. The approach achieves state-of-the-art or competitive results across supervised and zero-shot text-to-shape and 3D-to-3D tasks under arbitrary orientations, while remaining computationally efficient compared with RI-transformers. This work enables practical, scalable, rotation-robust retrieval in large 3D repositories, with significant implications for real-world search and scene assembly in AR/VR pipelines.

Abstract

3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at https://github.com/ndkhanh360/RI-Mamba.git.

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 8 figures, 5 tables)

This paper contains 30 sections, 8 equations, 8 figures, 5 tables.

Introduction
Related Works
Proposed Method
Reference Frame Computation (RFC)
Rotation-Invariant Point Serialization
Rotation-Invariant Patch Embeddings
Linear-time Orientational Embedding.
RI-Mamba Blocks
Feature-wise Linear Modulation (FiLM).
Reverse Operator for Bidirectional Scanning.
Cross-Modal Contrastive Learning
Experiments
Experimental Setup.
Pretraining Data for Zero-Shot Experiments.
Supervised Text-to-Shape Retrieval
...and 15 more sections

Figures (8)

Figure 1: Existing text-to-3D retrieval methods require human annotations and pre-aligned shapes, limiting them to a narrow set of object classes and restricting them to canonical poses. Our proposed RI-Mamba is designed to be rotation invariant and is trained via cross-modal contrastive learning on diverse 3D assets without manual annotations to enable retrieval across a wide range of object categories under arbitrary orientations.
Figure 2: Overview of our method. Given a point cloud, we form local patches using FPS and kNN. RI-Serialization and RFC align patches and define a RI token order. RI geometric, positional, and orientational embeddings are extracted and fed into RI-Mamba blocks, which model long-range relationships via FiLM, reverse operators, and Mamba modules (green). The final 3D feature is aligned with CLIP's image and text embeddings via cross-modal contrastive learning (yellow).
Figure 3: Retrieval results on Text2Shape dataset. Under random rotation, RI-Mamba maintains robust performance while SCA3D fails to retrieve the correct chair (green box) and includes irrelevant tables (red boxes) in top 3 candidates.
Figure 4: RI-Mamba and RI-Transformer efficiency comparisons.
Figure 5: Comparison on axis swap robustness and alternative RI approaches.
...and 3 more figures

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

TL;DR

Abstract

RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (8)