Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
Kunyu Feng, Yue Ma, Xinhua Zhang, Boshi Liu, Yikuang Yuluo, Yinhan Zhang, Runtao Liu, Hongyu Liu, Zhiyuan Qin, Shanhui Mo, Qifeng Chen, Zeyu Wang
TL;DR
The paper tackles the high cost of collecting task-specific data for advanced AIGC by proposing Follow-Your-Instruction, an MLLM-driven agent that synthesizes coherent 2D, 3D, and 4D world data. It introduces four components—MLLM-Collector, MLLM-Generator, MLLM-Optimizer, and MLLM-Planner—to automate asset gathering, scene construction, multi-view refinement, and temporally coherent video generation. Through extensive experiments across eight MLLMs and three downstream tasks, the approach demonstrates that synthetic data can substantially boost downstream performance, with a notable emphasis on multi-view consistency and temporal coherence. The work provides a scalable data engine for generative intelligence and sets benchmarks for evaluating MLLM-driven data synthesis across 2D, 3D, and 4D domains.
Abstract
With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.
