VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification
Jianmeng Liu, Yichen Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang
TL;DR
VP-LLM addresses the problem of text-guided 3D completion by integrating large language models with patch-wise 3D representations. The method patches voxel grids, encodes patches with a patchwise VAE, and uses input/output projection layers to fuse textual prompts with patch latents in an LLM, enabling a single forward pass to generate complete 3D models. The paper demonstrates that VP-LLM outperforms diffusion-based baselines on ShapeNet in CD and CLIP-s metrics and is robust to noisy inputs, while maintaining scalability through patchification. This work introduces a scalable, controllable approach for text-conditioned 3D completion with potential impact on 3D content creation and robotics.
Abstract
Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. To integrate a 3D model into the LLM tokenization configuration, the incomplete 3D object is first divided into small patches that can be encoded independently. These encoded patches are then fed into an LLM along with the text prompt, instructing the LLM to capture the relations between these patches as well as injecting semantic meanings into the 3D object. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.
