Table of Contents
Fetching ...

GLASS: Graph and Vision-Language Assisted Semantic Shape Correspondence

Qinfeng Xiao, Guofeng Mei, Qilong Liu, Chenyuan Yi, Fabio Poiesi, Jian Zhang, Bo Yang, Yick Kit-lun

TL;DR

GLASS introduces three key innovations: a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; a graph-assisted contrastive loss that enforces structural consistency between regions; and a graph-assisted contrastive loss that allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision.

Abstract

Establishing dense correspondence across 3D shapes is crucial for fundamental downstream tasks, including texture transfer, shape interpolation, and robotic manipulation. However, learning these mappings without manual supervision remains a formidable challenge, particularly under severe non-isometric deformations and in inter-class settings where geometric cues are ambiguous. Conventional functional map methods, while elegant, typically struggle in these regimes due to their reliance on isometry. To address this, we present GLASS, a framework that bridges the gap by integrating geometric spectral analysis with rich semantic priors from vision-language foundation models. GLASS introduces three key innovations: (i) a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; (ii) the injection of language embeddings into vertex descriptors via zero-shot 3D segmentation, capturing high-level part semantics; and (iii) a graph-assisted contrastive loss that enforces structural consistency between regions (e.g., source's head'' $\leftrightarrow$ target's head'') by leveraging geodesic and topological relationships between regions. This design allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision. Extensive experiments demonstrate that GLASS achieves state-of-the-art performance across all regimes, maintaining high accuracy on standard near-isometric tasks while significantly advancing performance in challenging settings. Specifically, it achieves average geodesic errors of 0.21, 4.5, and 5.6 on the inter-class benchmark SNIS and non-isometric benchmarks SMAL and TOPKIDS, reducing errors from URSSM baselines of 0.49, 6.0, and 8.9 by 57%, 25%, and 37%, respectively.

GLASS: Graph and Vision-Language Assisted Semantic Shape Correspondence

TL;DR

GLASS introduces three key innovations: a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; a graph-assisted contrastive loss that enforces structural consistency between regions; and a graph-assisted contrastive loss that allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision.

Abstract

Establishing dense correspondence across 3D shapes is crucial for fundamental downstream tasks, including texture transfer, shape interpolation, and robotic manipulation. However, learning these mappings without manual supervision remains a formidable challenge, particularly under severe non-isometric deformations and in inter-class settings where geometric cues are ambiguous. Conventional functional map methods, while elegant, typically struggle in these regimes due to their reliance on isometry. To address this, we present GLASS, a framework that bridges the gap by integrating geometric spectral analysis with rich semantic priors from vision-language foundation models. GLASS introduces three key innovations: (i) a view-consistent strategy that enables robust multi-view visual feature extraction from powerful vision foundation models; (ii) the injection of language embeddings into vertex descriptors via zero-shot 3D segmentation, capturing high-level part semantics; and (iii) a graph-assisted contrastive loss that enforces structural consistency between regions (e.g., source's head'' target's head'') by leveraging geodesic and topological relationships between regions. This design allows GLASS to learn globally coherent and semantically consistent maps without ground-truth supervision. Extensive experiments demonstrate that GLASS achieves state-of-the-art performance across all regimes, maintaining high accuracy on standard near-isometric tasks while significantly advancing performance in challenging settings. Specifically, it achieves average geodesic errors of 0.21, 4.5, and 5.6 on the inter-class benchmark SNIS and non-isometric benchmarks SMAL and TOPKIDS, reducing errors from URSSM baselines of 0.49, 6.0, and 8.9 by 57%, 25%, and 37%, respectively.
Paper Structure (15 sections, 10 equations, 4 figures, 5 tables)

This paper contains 15 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: GLASS achieves robust dense semantic correspondence across diverse 3D matching scenarios by unifying the functional map framework with vision-language priors and semantic graph alignment. Compared with the functional map baseline URSSM cao2023unsupervised, our method resolves semantic ambiguities in: 1) Inter-class matching, where semantic part alignment (e.g., human arm to horse front leg) is required across distinct categories; 2) Non-isometric deformations, involving significant morphological variations across subjects; and 3) Topological noise and near-isometry, where GLASS maintains high precision despite severe topological artifacts.
  • Figure 2: The overall structure of our GLASS. It consists of three key stages: (1) View-consistent feature lifting, where we synthesize coherent textures and lift SD-DINO features zhang2023tale onto the 3D surface; (2) Language-guided semantic injection, which enriches visual descriptors with linguistic priors using sentence encoder (e.g., SigLip tschannen2025siglip) embeddings derived from zero-shot region proposals; and (3) Region-aware map optimization, where we employ a functional map guided by a novel semantic graph-aware contrastive loss to ensure structural and semantic consistency.
  • Figure 3: Comparison of textured meshes and semantic features between Diff3F dutt2024diffusion and ours. By synthesizing highly realistic and view-coherent textures for 3D meshes, our view-consistent strategy (\ref{['sec:sem_feat_extraction']}) facilitates the extraction of high-fidelity, view-consistent SD-DINO features, thus boosting semantic matching. In contrast, Diff3F struggles with severe multi-view inconsistencies and visual artifacts during texturing, thereby degrading the quality of its resulting feature fields.
  • Figure 4: Example of a semantic region graph and region level contrasting. The graph represents high-level topological relationships between semantic regions, where nodes correspond to distinct parts and edges represent semantic relation priors. Guided by this structure, our graph-assisted contrastive loss pulls vertex features within the same region together while pushing distinct regions apart.