Table of Contents
Fetching ...

Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval

Wenrui Li, Yidan Lu, Yeyu Chai, Rui Zhao, Hengyu Man, Xiaopeng Fan

TL;DR

H^2ARN tackles text-3D retrieval by embedding text and 3D data in a Lorentz-model hyperbolic space to preserve hierarchical structure and reduce redundancy. It introduces a hierarchical ordering loss via entailment cones and a contribution-aware hyperbolic aggregation that weights local features by their semantic relevance, trained with a Lorentzian contrastive objective. The approach achieves state-of-the-art performance on the original and expanded T3DR-HIT datasets, validating its robustness and scalability across diverse indoor-scene and artifact categories. By explicitly modeling hierarchy in hyperbolic space and focusing attention on discriminative regions, the work advances cross-modal retrieval and provides a resource (T3DR-HIT v2) to accelerate further research.

Abstract

With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model's ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (H$^{2}$ARN) for text-3D retrieval. H$^{2}$ARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at https://github.com/liwrui/H2ARN.

Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval

TL;DR

H^2ARN tackles text-3D retrieval by embedding text and 3D data in a Lorentz-model hyperbolic space to preserve hierarchical structure and reduce redundancy. It introduces a hierarchical ordering loss via entailment cones and a contribution-aware hyperbolic aggregation that weights local features by their semantic relevance, trained with a Lorentzian contrastive objective. The approach achieves state-of-the-art performance on the original and expanded T3DR-HIT datasets, validating its robustness and scalability across diverse indoor-scene and artifact categories. By explicitly modeling hierarchy in hyperbolic space and focusing attention on discriminative regions, the work advances cross-modal retrieval and provides a resource (T3DR-HIT v2) to accelerate further research.

Abstract

With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model's ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (HARN) for text-3D retrieval. HARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at https://github.com/liwrui/H2ARN.

Paper Structure

This paper contains 13 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Conceptual illustration of hierarchical data representation. Left: The exponentially growing tree structures inherent in both abstract-to-specific semantics and whole-to-part geometry. Right: Comparison of embedding spaces. Euclidean space suffers from a "crowding" effect, whereas hyperbolic space naturally preserves the hierarchy. In hyperbolic space, the origin represents the most general concepts, with distance from the origin encoding semantic specificity.
  • Figure 2: An overview of the H$^{2}$ARN architecture. The Structural Context Encoder first refines local features from each modality in Euclidean space to produce context-aware representations. Subsequently, the Hyperbolic Hierarchical Alignment Module aligns the features in hyperbolic space via a contribution-aware aggregation mechanism and a dual geometric loss, preserving their semantic hierarchy.
  • Figure 3: Geometric illustration of the Hierarchical Ordering Loss. The loss enforces the "text entails 3D" partial order by penalizing a 3D embedding $h_p$ only if it lies outside the entailment cone defined by its corresponding text embedding $h_t$. The penalty is proportional to the difference between the exterior angle $\theta(h_t, h_p)$ and the cone's half-aperture $\phi(h_t)$.
  • Figure 4: Qualitative comparison of text-to-3D retrieval results on the T3DR-HIT v2 dataset. For each query, the top-5 retrieved point clouds are shown, ranked from left to right by matching score. Green boxes indicate correct matches, while red boxes indicate incorrect ones.