Table of Contents
Fetching ...

Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang

TL;DR

The paper tackles the high cost of collecting task-specific data for advanced AIGC by proposing Follow-Your-Instruction, an MLLM-driven agent that synthesizes coherent 2D, 3D, and 4D world data. It introduces four components—MLLM-Collector, MLLM-Generator, MLLM-Optimizer, and MLLM-Planner—to automate asset gathering, scene construction, multi-view refinement, and temporally coherent video generation. Through extensive experiments across eight MLLMs and three downstream tasks, the approach demonstrates that synthetic data can substantially boost downstream performance, with a notable emphasis on multi-view consistency and temporal coherence. The work provides a scalable data engine for generative intelligence and sets benchmarks for evaluating MLLM-driven data synthesis across 2D, 3D, and 4D domains.

Abstract

With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.

Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

TL;DR

The paper tackles the high cost of collecting task-specific data for advanced AIGC by proposing Follow-Your-Instruction, an MLLM-driven agent that synthesizes coherent 2D, 3D, and 4D world data. It introduces four components—MLLM-Collector, MLLM-Generator, MLLM-Optimizer, and MLLM-Planner—to automate asset gathering, scene construction, multi-view refinement, and temporally coherent video generation. Through extensive experiments across eight MLLMs and three downstream tasks, the approach demonstrates that synthetic data can substantially boost downstream performance, with a notable emphasis on multi-view consistency and temporal coherence. The work provides a scalable data engine for generative intelligence and sets benchmarks for evaluating MLLM-driven data synthesis across 2D, 3D, and 4D domains.

Abstract

With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.

Paper Structure

This paper contains 11 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of Follow-Your-Instruction. We introduce Follow-Your-Instruction, an advanced MLLM-driven agent framework that synthesizes high-quality world data across 2D, 3D, and 4D levels, benefiting various downstream applications.
  • Figure 2: The Pipeline of Follow-Your-Instruction. Given multimodal inputs, Follow-Your-Instruction first collects the assets and their descriptions via MLLM-Collector. Then the MLLM-Generator creates the 3D layout scene and optimizes the scene via multi-view MLLM-Optimizer with a powerful VLM. Based on the scene, MLLM-Planner formulates a clear plan to generate the high-quality output video.
  • Figure 3: Motivation for Multi-view Optimization. (a): Optimizing the constructed scene from a single view may yield satisfactory results in that specific view, but object placements often exhibit semantic misalignments when observed from other perspectives. (b): Our proposed multi-view optimization effectively mitigates such inconsistencies by improving semantic correctness across multiple viewpoints, leading to globally coherent scene layouts.
  • Figure 4: Diverse downstream applications supported by Follow-Your-Instruction. Each task is accompanied by tailored annotations, such as background masks for object removal, camera trajectories for relighting, 3D and 4D reconstruction, as well as depth maps and object poses for 3D embodied intelligence.
  • Figure 5: Qualitative results for 2D object removal, 3D reconstruction, and 4D generation applications. The results show that our generated data exhibits better effectiveness for improving the performance of existing models.
  • ...and 2 more figures