Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models
Dingning Liu, Xiaoshui Huang, Yuenan Hou, Zhihui Wang, Zhenfei Yin, Yongshun Gong, Peng Gao, Wanli Ouyang
TL;DR
Uni3D-LLM introduces a unified, LLM-driven framework that jointly handles point-cloud perception, generation, and editing by aligning multimodal inputs (point clouds and images) into a common textual space and conditioning a diffusion-based generator through a dedicated LLM-to-generation mapping block. A two-stage training regimen—Stage I for the generation conditioning with latent-diffusion losses and Stage II for perception via PEFT/LoRA—enables robust cross-task performance while preserving existing LLM knowledge. The architecture employs a multi-modality alignment backbone, a 259-token generation conditioning scheme, and a generation-to-editing pipeline that refines objects through rendered views and InstructPix2Pix-style updates. Experiments across perception, generation, and editing demonstrate improved cross-modal understanding and generation quality when image cues are integrated, with Cap3descript data further enhancing natural-language driven 3D content creation. The work shows practical potential for interactive 3D design and editing guided by natural language, while noting current limitations in large-scale scene generation and flexible editing that future work should address.
Abstract
In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects, thereby significantly enhancing operational flexibility and controllability. By mapping point cloud into the unified representation space, Uni3D-LLM achieves cross-application functionality, enabling the seamless execution of a wide array of tasks, ranging from the accurate instantiation of 3D objects to the diverse requirements of interactive design. Through a comprehensive suite of rigorous experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and editing of point cloud has been validated. Additionally, we have assessed the impact of integrating a point cloud perception module on the generation and editing processes, confirming the substantial potential of our approach for practical applications.
