Table of Contents
Fetching ...

Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds

Hongliang Zeng, Ping Zhang, Fang Li, Jiahua Wang, Tingyu Ye, Pengteng Guo

TL;DR

Point-MGE presents a unified framework that integrates 3D representation learning and generative modeling for point clouds by tokenizing local patches with a VQVAE into discrete NeRF-based semantic tokens and training with a sliding masking ratio. A ViT-based extractor-generator processes visible tokens to reconstruct masked tokens and center coordinates, enabling both high-quality representations and realistic 3D generation, including unconditional and conditional synthesis. The approach addresses sampling bias in point clouds, scales to high-capacity models, and achieves strong results on downstream tasks (e.g., ModelNet40, ScanObjectNN, ShapeNetPart) while delivering competitive 3D generation metrics. Overall, Point-MGE demonstrates the practical impact of synergizing representation and generation in 3D vision, offering a pathway for robust, high-fidelity 3D shape understanding and synthesis.

Abstract

Representation and generative learning, as reconstruction-based methods, have demonstrated their potential for mutual reinforcement across various domains. In the field of point cloud processing, although existing studies have adopted training strategies from generative models to enhance representational capabilities, these methods are limited by their inability to genuinely generate 3D shapes. To explore the benefits of deeply integrating 3D representation learning and generative learning, we propose an innovative framework called \textit{Point-MGE}. Specifically, this framework first utilizes a vector quantized variational autoencoder to reconstruct a neural field representation of 3D shapes, thereby learning discrete semantic features of point patches. Subsequently, we design a sliding masking ratios to smooth the transition from representation learning to generative learning. Moreover, our method demonstrates strong generalization capability in learning high-capacity models, achieving new state-of-the-art performance across multiple downstream tasks. In shape classification, Point-MGE achieved an accuracy of 94.2% (+1.0%) on the ModelNet40 dataset and 92.9% (+5.5%) on the ScanObjectNN dataset. Experimental results also confirmed that Point-MGE can generate high-quality 3D shapes in both unconditional and conditional settings.

Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds

TL;DR

Point-MGE presents a unified framework that integrates 3D representation learning and generative modeling for point clouds by tokenizing local patches with a VQVAE into discrete NeRF-based semantic tokens and training with a sliding masking ratio. A ViT-based extractor-generator processes visible tokens to reconstruct masked tokens and center coordinates, enabling both high-quality representations and realistic 3D generation, including unconditional and conditional synthesis. The approach addresses sampling bias in point clouds, scales to high-capacity models, and achieves strong results on downstream tasks (e.g., ModelNet40, ScanObjectNN, ShapeNetPart) while delivering competitive 3D generation metrics. Overall, Point-MGE demonstrates the practical impact of synergizing representation and generation in 3D vision, offering a pathway for robust, high-fidelity 3D shape understanding and synthesis.

Abstract

Representation and generative learning, as reconstruction-based methods, have demonstrated their potential for mutual reinforcement across various domains. In the field of point cloud processing, although existing studies have adopted training strategies from generative models to enhance representational capabilities, these methods are limited by their inability to genuinely generate 3D shapes. To explore the benefits of deeply integrating 3D representation learning and generative learning, we propose an innovative framework called \textit{Point-MGE}. Specifically, this framework first utilizes a vector quantized variational autoencoder to reconstruct a neural field representation of 3D shapes, thereby learning discrete semantic features of point patches. Subsequently, we design a sliding masking ratios to smooth the transition from representation learning to generative learning. Moreover, our method demonstrates strong generalization capability in learning high-capacity models, achieving new state-of-the-art performance across multiple downstream tasks. In shape classification, Point-MGE achieved an accuracy of 94.2% (+1.0%) on the ModelNet40 dataset and 92.9% (+5.5%) on the ScanObjectNN dataset. Experimental results also confirmed that Point-MGE can generate high-quality 3D shapes in both unconditional and conditional settings.

Paper Structure

This paper contains 31 sections, 12 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overall pipeline of Point-MGE. First, a VQVAE is used to reconstruct 3D shapes represented by NeRFs, converting the input point cloud into a series of discrete semantic tokens. Then, where a ViT-based extractor-generator architecture is employed to extract high-quality feature representations from the unmasked tokens and to reconstruct the masked tokens.
  • Figure 2: Visualization of unconditional generation. All models were trained on the complete ShapeNet dataset, and we sampled nine generated results for rendering.
  • Figure 3: Visualization of category-conditional generation. We selected three category labels (airplane, faucet, and table) to display the rendered results of the generated shapes.
  • Figure 4: Visualization of masked point clouds conditional generation. We used a block masking strategy to mask 70% of the complete point cloud as conditional input. The generated results demonstrated the ability of different models to reconstruct the complete shapes.