Table of Contents
Fetching ...

Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents

Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, Yanfeng Wang

TL;DR

ChatSim tackles the need for editable, photo-realistic 3D driving scene data generation guided by natural language, enabling users to specify complex scenarios and incorporate external assets. It combines a collaborative LLM-Agents framework with two novel renderers, McNeRF for multi-camera background rendering with HDR radiance and exposure handling, and McLight for asset lighting with skydome and environment illumination. On Waymo Open Dataset, ChatSim demonstrates robust handling of abstract and multi-round commands and achieves state-of-the-art photo-realism and lighting accuracy, enabling richer, controllable data generation for perception tasks. The approach has practical impact for data augmentation, rare-case simulation, and testing perception systems under diverse conditions.

Abstract

Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However, existing editable scene simulation approaches face limitations in terms of user interaction efficiency, multi-camera photo-realistic rendering and external digital assets integration. To address these challenges, this paper introduces ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility,~ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes, ChatSim employs a novel multi-camera neural radiance field method. Furthermore, to unleash the potential of extensive high-quality digital assets, ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos.

Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents

TL;DR

ChatSim tackles the need for editable, photo-realistic 3D driving scene data generation guided by natural language, enabling users to specify complex scenarios and incorporate external assets. It combines a collaborative LLM-Agents framework with two novel renderers, McNeRF for multi-camera background rendering with HDR radiance and exposure handling, and McLight for asset lighting with skydome and environment illumination. On Waymo Open Dataset, ChatSim demonstrates robust handling of abstract and multi-round commands and achieves state-of-the-art photo-realism and lighting accuracy, enabling richer, controllable data generation for perception tasks. The approach has practical impact for data augmentation, rare-case simulation, and testing perception systems under diverse conditions.

Abstract

Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However, existing editable scene simulation approaches face limitations in terms of user interaction efficiency, multi-camera photo-realistic rendering and external digital assets integration. To address these challenges, this paper introduces ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility,~ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes, ChatSim employs a novel multi-camera neural radiance field method. Furthermore, to unleash the potential of extensive high-quality digital assets, ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos.
Paper Structure (37 sections, 6 equations, 17 figures, 7 tables)

This paper contains 37 sections, 6 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: ChatSim enables the editing of photo-realistic 3D driving scene simulations via language commands.
  • Figure 2: ChatSim system overview. The system exploit multiple collaborative LLM agents with specialized roles to decouple an overall demand into specific editing tasks. Each agent equips an LLM and corresponding role functions to interpret and execute its specific tasks.
  • Figure 3: Prompt example of view adjustment agent.
  • Figure 4: Rendering framework. The main components include McNeRF and McLight. Background rendering uses McNeRF to predict HDR pixel value and convert it to LDR with sRGB OETF. McLight includes a skydome lighting estimation network and adopts McNeRF to generate surrounding lighting.
  • Figure 5: Editing result under a complex and mixed command.
  • ...and 12 more figures