HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

Zhiying Leng; Tolga Birdal; Xiaohui Liang; Federico Tombari

HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

Zhiying Leng, Tolga Birdal, Xiaohui Liang, Federico Tombari

TL;DR

HyperSDFusion addresses the challenge of text-to-shape generation by modeling the inherent hierarchical structures of language and 3D geometry in hyperbolic space. It introduces a dual-branch latent diffusion framework that separately leverages sequential text features via a hyperbolic text-image encoder and hierarchical text features via a hyperbolic text-graph convolution module, with a hyperbolic hierarchical loss to preserve part-whole structure in generated shapes. The method achieves state-of-the-art results on Text2Shape, demonstrating improved fidelity (IoU, CD, FID, F-score) and stronger preservation of text-shape hierarchy, including robustness to longer prompts. This work advances practical 3D content generation by directly learning joint hierarchical representations, enabling more faithful and scalable text-conditioned shape synthesis.

Abstract

3D shape generation from text is a fundamental task in 3D representation learning. The text-shape pairs exhibit a hierarchical structure, where a general text like ``chair" covers all 3D shapes of the chair, while more detailed prompts refer to more specific shapes. Furthermore, both text and 3D shapes are inherently hierarchical structures. However, existing Text2Shape methods, such as SDFusion, do not exploit that. In this work, we propose HyperSDFusion, a dual-branch diffusion model that generates 3D shapes from a given text. Since hyperbolic space is suitable for handling hierarchical data, we propose to learn the hierarchical representations of text and 3D shapes in hyperbolic space. First, we introduce a hyperbolic text-image encoder to learn the sequential and multi-modal hierarchical features of text in hyperbolic space. In addition, we design a hyperbolic text-graph convolution module to learn the hierarchical features of text in hyperbolic space. In order to fully utilize these text features, we introduce a dual-branch structure to embed text features in 3D feature space. At last, to endow the generated 3D shapes with a hierarchical structure, we devise a hyperbolic hierarchical loss. Our method is the first to explore the hyperbolic hierarchical representation for text-to-shape generation. Experimental results on the existing text-to-shape paired dataset, Text2Shape, achieved state-of-the-art results. We release our implementation under HyperSDFusion.github.io.

HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

TL;DR

Abstract

Paper Structure (43 sections, 5 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 43 sections, 5 equations, 12 figures, 6 tables, 2 algorithms.

Introduction
Related works
Text-to-Shape Generation.
Diffusion Models.
Hyperbolic Representation learning.
Preliminaries
Hyperbolic space.
Hyperbolic Graph Convolution (HGC)
Method
A Dual-branch Latent Diffusion Model
3D Shape Compression
Forward Process of The Latent Diffusion Model.
Reverse Process of The Latent Diffusion Model based on Text Conditions.
Dual-branch Denoiser.
Text Feature Learning in Hyperbolic Space
...and 28 more sections

Figures (12)

Figure 1: Hyperbolic text-shape representations. (a) The hierarchical structure between text and 3D shape. (b) The syntactic tree of text. (c) The hierarchical part-to-whole relationships of 3D shape.
Figure 2: Overview of the proposed HyperSDFusion. (a) The forward and reverse processes of the proposed dual-branch diffusion model from $Z_{0}$ to $Z_{T}$. In particular, the detailed denoising process of the latent feature $Z_{t}$ based on text conditions $\{C_{1},C_{2}\}$ is showcased. (b) The architecture of a VQVAE for 3D shape represented by SDF. (c) The attention module in the denoiser of the diffusion model.
Figure 3: Illustration of our proposed modules. (a) Given a text, the hyperbolic text-image encoder learns both sequential and multi-modal hierarchical features of the text, $C_{1}$. (b) The hyperbolic text-graph convolution module learns hierarchical syntactic features of the text, $C_{2}$. (c) Hyperbolic hierarchical loss supervises the hierarchical structure of 3D shape features in hyperbolic space.
Figure 4: The showcase of text-to-shape generation results. Above the dotted line are some examples generated by our method, and below is the result compared to SDFusion sdfusion2023.
Figure 5: Visualizing text-shape hierarchical structure. Highlighted parts of the prompt represent the detailed information.
...and 7 more figures

HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

TL;DR

Abstract

HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)