Table of Contents
Fetching ...

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

Bryce Grant, Aryeh Rothenberg, Atri Banerjee, Peng Wang

TL;DR

This work presents TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference, and introduces Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches.

Abstract

Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from $O(N)$ clicks to a single text query. The model processes each frame at 1008x1008 resolution in $\sim$57ms ($\sim$18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

TL;DR

This work presents TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference, and introduces Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches.

Abstract

Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from clicks to a single text query. The model processes each frame at 1008x1008 resolution in 57ms (18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications. Code and checkpoints are available at https://cwru-aism.github.io/triangulang/.
Paper Structure (86 sections, 20 equations, 7 figures, 14 tables)

This paper contains 86 sections, 20 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Overview of the TrianguLang architecture
  • Figure 2: Overview of the GASA decoder.
  • Figure 3: Performance on uCO3D and ScanNet++ datasets. Left-to-right: RGB, depth map, Ground Truth, TrianguLang masks
  • Figure 4: Qualitative comparison on LERF-OVS scenes using uniform clip thresholds. Row 1: "toaster" query: LERF and LangSplatV2 produce diffuse activations across the scene while TrianguLang tightly focuses its relevancy map on the target object. Row 3: "stripes" query: TrianguLang achieves precise localization despite not training on this dataset, and runs 3 orders of magnitude faster ($\sim$58ms vs. 10 to 45 min).
  • Figure 5: Spatial disambiguation on the NVOS T-Rex scene. Top: The query "dino" (97.6% IoU) segments the dominant triceratops skull in the scene. Bottom: The query "leftmost dino" (95.8% IoU) leverages spatial reasoning to disambiguate between the two skulls, correctly selecting only the left specimen. The depth map (second column) provides the geometric context that enables this: TrianguLang computes 3D centroids for each candidate mask and selects the one satisfying the spatial constraint ($\arg\min_i x_i$ for "leftmost"), resolving ambiguity that would be impossible with object names alone.
  • ...and 2 more figures