Table of Contents
Fetching ...

VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

Jianmeng Liu, Yichen Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang

TL;DR

VP-LLM addresses the problem of text-guided 3D completion by integrating large language models with patch-wise 3D representations. The method patches voxel grids, encodes patches with a patchwise VAE, and uses input/output projection layers to fuse textual prompts with patch latents in an LLM, enabling a single forward pass to generate complete 3D models. The paper demonstrates that VP-LLM outperforms diffusion-based baselines on ShapeNet in CD and CLIP-s metrics and is robust to noisy inputs, while maintaining scalability through patchification. This work introduces a scalable, controllable approach for text-conditioned 3D completion with potential impact on 3D content creation and robotics.

Abstract

Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. To integrate a 3D model into the LLM tokenization configuration, the incomplete 3D object is first divided into small patches that can be encoded independently. These encoded patches are then fed into an LLM along with the text prompt, instructing the LLM to capture the relations between these patches as well as injecting semantic meanings into the 3D object. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.

VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

TL;DR

VP-LLM addresses the problem of text-guided 3D completion by integrating large language models with patch-wise 3D representations. The method patches voxel grids, encodes patches with a patchwise VAE, and uses input/output projection layers to fuse textual prompts with patch latents in an LLM, enabling a single forward pass to generate complete 3D models. The paper demonstrates that VP-LLM outperforms diffusion-based baselines on ShapeNet in CD and CLIP-s metrics and is robust to noisy inputs, while maintaining scalability through patchification. This work introduces a scalable, controllable approach for text-conditioned 3D completion with potential impact on 3D content creation and robotics.

Abstract

Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. To integrate a 3D model into the LLM tokenization configuration, the incomplete 3D object is first divided into small patches that can be encoded independently. These encoded patches are then fed into an LLM along with the text prompt, instructing the LLM to capture the relations between these patches as well as injecting semantic meanings into the 3D object. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
Paper Structure (28 sections, 2 equations, 16 figures, 6 tables)

This paper contains 28 sections, 2 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Overview. VP-LLM leverages the long-context comprehension capability of Large Language Models (LLMs) to process 3D models. It takes either incomplete or noisy 3D models along with textual instructions as input, and generate a complete model. This is achieved by segmenting the 3D object into patches and processing each independently.
  • Figure 2: Patchification: given a 3D object, we first fit it into a voxel grid and then divide it into a sequence of small patches. Next, we utilize a patch-wise Variational Autoencoder (VAE) to extract the features of each patch individually and then reconstruct it back. It is important to note that only one VAE is trained for all the patches throughout the entire dataset, making our method a scalable approach.
  • Figure 3: The training process of the input projection (left) and output projection (right). During the input projection training, a single share-weighted MLP maps the masked or noisy 3D tokens encoded by our patch-wise VAE to the embedding space of the LLM. After wrapping the prompt with the 3D tokens as input and feeding them to the LLM, we back-propagate the loss calculated between the ground-truth caption and the LLM's prediction, enabling the LLM to learn to generate captions that accurately describe the 3D object from the input patches. For the output projection, we freeze the input projection layer and train the output projection layer, while also fine-tuning the LLM with LoRA. The output projection layer comprises a Transformer and a cluster of MLPs, such that after passing the Transformer, every 3D token is processed independently with an MLP.
  • Figure 4: 3D data augmentation example result of an airplane.
  • Figure 5: From left to right: Object 1, 2, 3, 4.
  • ...and 11 more figures