Table of Contents
Fetching ...

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

Qinfeng Xiao, Guofeng Mei, Bo Yang, Liying Zhang, Jian Zhang, Kit-lun Yick

TL;DR

UniMatch addresses universal 3D shape matching across inter-class and highly non-isometric deformations by a coarse-to-fine pipeline that first builds semantic PartField-based regions and names them with multimodal language guidance, then learns dense correspondences by extending the functional map framework with semantic feature fields and a group-wise Rank-n-Contrastive loss. The coarse stage leverages class-agnostic segmentation and FG-CLIP language embeddings, while the fine stage fuses geometric and semantic features (from SD-DINO) and optimizes a data-regularized, semantically consistent map $ abla f_ ext{out} ightarrow C_{yx}$. Empirically, UniMatch achieves state-of-the-art or competitive results across inter-class, non-isometric, and near-isometric benchmarks, demonstrates semantic co-segmentation capabilities, and exhibits robustness to in-the-wild object categories, all without predefined part priors. This combination of coarse language-guided semantics and a fine-grained, learned functional-map refinement enables scalable, universal 3D shape matching with practical impact for graphics, robotics, and beyond.

Abstract

Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

TL;DR

UniMatch addresses universal 3D shape matching across inter-class and highly non-isometric deformations by a coarse-to-fine pipeline that first builds semantic PartField-based regions and names them with multimodal language guidance, then learns dense correspondences by extending the functional map framework with semantic feature fields and a group-wise Rank-n-Contrastive loss. The coarse stage leverages class-agnostic segmentation and FG-CLIP language embeddings, while the fine stage fuses geometric and semantic features (from SD-DINO) and optimizes a data-regularized, semantically consistent map . Empirically, UniMatch achieves state-of-the-art or competitive results across inter-class, non-isometric, and near-isometric benchmarks, demonstrates semantic co-segmentation capabilities, and exhibits robustness to in-the-wild object categories, all without predefined part priors. This combination of coarse language-guided semantics and a fine-grained, learned functional-map refinement enables scalable, universal 3D shape matching with practical impact for graphics, robotics, and beyond.

Abstract

Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the "fine" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.
Paper Structure (28 sections, 6 equations, 10 figures, 5 tables)

This paper contains 28 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We propose UniMatch, a semantic-aware, coarse-to-fine framework for 3D shape matching. Our method yields high-quality semantic correspondences under challenging scenarios, e.g., cross-category shapes and non-isometric deformations. Besides, UniMatch learns semantically consistent cross-shape features, enabling efficient co-segmentation for distinct categories.
  • Figure 2: Our framework consists of two stages: (i) Coarse stage: class-agnostic 3D segmentation (PartField) produces non-overlapping semantic regions. Multi-view rendering and multimodal large language model (MLLM) prompting (using GPT-5 and FG-CLIP) assign part names and map them to unified language embeddings, enabling implicit, robust coarse correspondences across shapes. (ii) Fine stage: the coarse semantic matches guide dense correspondence learning within an extended functional map pipeline, leveraging SD-DINO semantic feature fields and a novel group-wise rank-based contrastive loss to enforce semantic consistency. UniMatch operates without predefined part priors and generalizes across object categories and non-isometric deformations.
  • Figure 3: Intuition of rank-based contrastive loss. Compared with standard contrastive loss, rank-based contrastive loss effectively utilizes the ordinal hints to optimize features.
  • Figure 4: Illustration of our group-wise Rank-n-Contrastive Loss. For each anchor feature, we traverse the semantic regions as reference features, and group the negative samples by their semantic similarities (lower than reference). The darker the color, the closer the semantic similarity.
  • Figure 5: Visual comparison with the state-of-the-art method, DenseMatcher zhu2025densematcher on inter-class shape matching. UniMatch demonstrates semantic consistency and smooth correspondences for challenging cross-category matching.
  • ...and 5 more figures