Table of Contents
Fetching ...

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

Dingning Liu, Xiaoshui Huang, Yuenan Hou, Zhihui Wang, Zhenfei Yin, Yongshun Gong, Peng Gao, Wanli Ouyang

TL;DR

Uni3D-LLM introduces a unified, LLM-driven framework that jointly handles point-cloud perception, generation, and editing by aligning multimodal inputs (point clouds and images) into a common textual space and conditioning a diffusion-based generator through a dedicated LLM-to-generation mapping block. A two-stage training regimen—Stage I for the generation conditioning with latent-diffusion losses and Stage II for perception via PEFT/LoRA—enables robust cross-task performance while preserving existing LLM knowledge. The architecture employs a multi-modality alignment backbone, a 259-token generation conditioning scheme, and a generation-to-editing pipeline that refines objects through rendered views and InstructPix2Pix-style updates. Experiments across perception, generation, and editing demonstrate improved cross-modal understanding and generation quality when image cues are integrated, with Cap3descript data further enhancing natural-language driven 3D content creation. The work shows practical potential for interactive 3D design and editing guided by natural language, while noting current limitations in large-scale scene generation and flexible editing that future work should address.

Abstract

In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects, thereby significantly enhancing operational flexibility and controllability. By mapping point cloud into the unified representation space, Uni3D-LLM achieves cross-application functionality, enabling the seamless execution of a wide array of tasks, ranging from the accurate instantiation of 3D objects to the diverse requirements of interactive design. Through a comprehensive suite of rigorous experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and editing of point cloud has been validated. Additionally, we have assessed the impact of integrating a point cloud perception module on the generation and editing processes, confirming the substantial potential of our approach for practical applications.

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

TL;DR

Uni3D-LLM introduces a unified, LLM-driven framework that jointly handles point-cloud perception, generation, and editing by aligning multimodal inputs (point clouds and images) into a common textual space and conditioning a diffusion-based generator through a dedicated LLM-to-generation mapping block. A two-stage training regimen—Stage I for the generation conditioning with latent-diffusion losses and Stage II for perception via PEFT/LoRA—enables robust cross-task performance while preserving existing LLM knowledge. The architecture employs a multi-modality alignment backbone, a 259-token generation conditioning scheme, and a generation-to-editing pipeline that refines objects through rendered views and InstructPix2Pix-style updates. Experiments across perception, generation, and editing demonstrate improved cross-modal understanding and generation quality when image cues are integrated, with Cap3descript data further enhancing natural-language driven 3D content creation. The work shows practical potential for interactive 3D design and editing guided by natural language, while noting current limitations in large-scale scene generation and flexible editing that future work should address.

Abstract

In this paper, we introduce Uni3D-LLM, a unified framework that leverages a Large Language Model (LLM) to integrate tasks of 3D perception, generation, and editing within point cloud scenes. This framework empowers users to effortlessly generate and modify objects at specified locations within a scene, guided by the versatility of natural language descriptions. Uni3D-LLM harnesses the expressive power of natural language to allow for precise command over the generation and editing of 3D objects, thereby significantly enhancing operational flexibility and controllability. By mapping point cloud into the unified representation space, Uni3D-LLM achieves cross-application functionality, enabling the seamless execution of a wide array of tasks, ranging from the accurate instantiation of 3D objects to the diverse requirements of interactive design. Through a comprehensive suite of rigorous experiments, the efficacy of Uni3D-LLM in the comprehension, generation, and editing of point cloud has been validated. Additionally, we have assessed the impact of integrating a point cloud perception module on the generation and editing processes, confirming the substantial potential of our approach for practical applications.
Paper Structure (15 sections, 2 equations, 5 figures, 4 tables)

This paper contains 15 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 2: The framework overview. We first decompose the point cloud into sub point clouds using 3D detection algorithms and obtain its top-down view rendered image features. Both features added to complete various 3D perception tasks. Our instructions are passed to the generation and editing module through the mapping block.
  • Figure 3: Multimodal alignment method. In which the images are re-connected together in the channel dimension after 3 encoder passes, in addition, the image is passing extra QFormer as a global feature concat with the other feature. the ROI is then recoded by Pointbert, with additional padding after cross attention in Query.
  • Figure 4: The pipline of Editing method. After we obtain a 3D model, we will also store the initialized 3D Gaussian. When the user is given the modification instruction, Generation-to-editing module will render multiple fixed perspective images and send them to Instructpix2pix and then update the 3D Gaussian gradually by updating the specified pose.
  • Figure 5: The method of Cap3descript. We input 8 perspectives render images of an object into MLLM to obtain different captions, and then use GPT for integration.
  • Figure 6: A complete multi faceted description of an example. Each angle caption is roughly same, but there may be some differences in details for specific angles.