Table of Contents
Fetching ...

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

Abstract

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Abstract

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.
Paper Structure (12 sections, 4 equations, 4 figures, 2 tables)

This paper contains 12 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the dataset construction process. We build the multi-agent synchronized training dataset by rendering in CARLA. We equip each agent with four cameras (front, rear, left, and right) and predefine six main trajectory pairs, as shown in (a) -- (f). Each pair enables interaction between agents.
  • Figure 2: Method overview. Given one image from each of two agents, where four views are concatenated to depict a scene, ShareVerse performs a prediction task of 49 video frames to generate future videos conditioned on the camera trajectories from users. In the generation process, two agents explore the world, exchange captured visual information, and perceive each other’s positions.
  • Figure 3: Qualitative results. a) The four-view video of a certain agent maintains strong internal consistency. b) Agents generate visual scenes based on camera trajectories and interact with dynamically generated information, rather than being conditioned solely on the first frame. c) We compare the generated videos before and after the removal of certain buildings within a map. The sample shows that agents effectively exchange information based on the scene context, achieving a shared world.
  • Figure 4: Dynamic positions. We keep the trajectory of one agent fixed and modify the other to generate two video samples. Comparison with the ground truth demonstrates that our model can accurately perceive the positions of the other agent.