Table of Contents
Fetching ...

Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders

Jingyu Guo, Sensen Gao, Jia-Wang Bian, Wanhu Sun, Heliang Zheng, Rongfei Jia, Mingming Gong

TL;DR

Hyper3D targets the bottleneck of high-fidelity 3D shape encoding for VAEs used in diffusion-based 3D generation. It introduces an octree-based input feature extractor and a hybrid triplane latent that combines a high-resolution 2D plane with a low-resolution 3D grid to preserve explicit 3D structure while maintaining a compact latent. Through extensive experiments on Objaverse, Hyper3D outperforms baselines in reconstruction quality and fine geometric detail, with ablations validating the contributions of octree inputs and the hybrid latent. The work paves the way for more efficient, high-detail 3D generation pipelines and suggests future directions including stronger generative models and multi-modal texture integration.

Abstract

Recent 3D content generation pipelines often leverage Variational Autoencoders (VAEs) to encode shapes into compact latent representations, facilitating diffusion-based generation. Efficiently compressing 3D shapes while preserving intricate geometric details remains a key challenge. Existing 3D shape VAEs often employ uniform point sampling and 1D/2D latent representations, such as vector sets or triplanes, leading to significant geometric detail loss due to inadequate surface coverage and the absence of explicit 3D representations in the latent space. Although recent work explores 3D latent representations, their large scale hinders high-resolution encoding and efficient training. Given these challenges, we introduce Hyper3D, which enhances VAE reconstruction through efficient 3D representation that integrates hybrid triplane and octree features. First, we adopt an octree-based feature representation to embed mesh information into the network, mitigating the limitations of uniform point sampling in capturing geometric distributions along the mesh surface. Furthermore, we propose a hybrid latent space representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design not only compensates for the lack of explicit 3D representations but also leverages a triplane to preserve high-resolution details. Experimental results demonstrate that Hyper3D outperforms traditional representations by reconstructing 3D shapes with higher fidelity and finer details, making it well-suited for 3D generation pipelines.

Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders

TL;DR

Hyper3D targets the bottleneck of high-fidelity 3D shape encoding for VAEs used in diffusion-based 3D generation. It introduces an octree-based input feature extractor and a hybrid triplane latent that combines a high-resolution 2D plane with a low-resolution 3D grid to preserve explicit 3D structure while maintaining a compact latent. Through extensive experiments on Objaverse, Hyper3D outperforms baselines in reconstruction quality and fine geometric detail, with ablations validating the contributions of octree inputs and the hybrid latent. The work paves the way for more efficient, high-detail 3D generation pipelines and suggests future directions including stronger generative models and multi-modal texture integration.

Abstract

Recent 3D content generation pipelines often leverage Variational Autoencoders (VAEs) to encode shapes into compact latent representations, facilitating diffusion-based generation. Efficiently compressing 3D shapes while preserving intricate geometric details remains a key challenge. Existing 3D shape VAEs often employ uniform point sampling and 1D/2D latent representations, such as vector sets or triplanes, leading to significant geometric detail loss due to inadequate surface coverage and the absence of explicit 3D representations in the latent space. Although recent work explores 3D latent representations, their large scale hinders high-resolution encoding and efficient training. Given these challenges, we introduce Hyper3D, which enhances VAE reconstruction through efficient 3D representation that integrates hybrid triplane and octree features. First, we adopt an octree-based feature representation to embed mesh information into the network, mitigating the limitations of uniform point sampling in capturing geometric distributions along the mesh surface. Furthermore, we propose a hybrid latent space representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design not only compensates for the lack of explicit 3D representations but also leverages a triplane to preserve high-resolution details. Experimental results demonstrate that Hyper3D outperforms traditional representations by reconstructing 3D shapes with higher fidelity and finer details, making it well-suited for 3D generation pipelines.

Paper Structure

This paper contains 23 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Latent space representation for 3D shape VAE: Triplane vs. Hybrid Triplane. The left depicts the triplane representation composed of three 2D planes. On the right, we propose a hybrid triplane representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design incorporates explicit 3D representation in the latent space while keeping the latent size manageable, ensuring high resolution without excessive complexity or training difficulty.
  • Figure 2: Qualitative comparison of VAE reconstruction results. "Ours (32/8)" denotes our Hyper3D-VAE, where the resolutions of the latent triplane and latent grid are set to 32 and 8, respectively. "Direct3D (36)" refers to a variant of Direct3D where we adjust the latent triplane resolution to 36, ensuring a fair comparison between Ours (32/8) and Direct3D (36) with similar latent token lengths (3584 vs. 3888). Similarly, "Ours (64/16)" represents our model with a latent triplane resolution of 64 and a latent grid resolution of 16, allowing for a comparison with Trellis (16,384 vs. 20,000). (Better viewed with zoom-in.)
  • Figure 3: Overview of the proposed Hyper3D-VAE. Instead of relying solely on uniform sampling on mesh surfaces, we utilize an octree-based 3D feature extractor to capture high-frequency geometric details more effectively. During encoding, we introduce learnable triplane tokens and learnable grid tokens, which are concatenated to form learnable hybrid tokens. This design enables the model to effectively capture spatial dependencies across both 2D and 3D representations. In our decoder, the latent hybrid tokens are separated and reshaped into their respective 2D and 3D structures, followed by several upconvolutional layers.
  • Figure 4: Qualitative comparison for the ablation of VAE with different input strategies.
  • Figure 5: Qualitative comparison for the ablation of VAE with different representations. The second and third columns present a comparison between triplane and hybrid triplane under latent token lengths of 3,888 and 3,584, respectively. The fourth and fifth columns present a comparison under latent token lengths of 16,428 and 16,384, respectively.
  • ...and 4 more figures