V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang
TL;DR
V-Stylist tackles open-query, long-video stylization by orchestrating three MLLM-based agents: Video Parser, which segments videos into shots and generates per-shot prompts; Style Parser, which identifies and selects an appropriate style model via a tree-of-thought search over a style model tree; and Style Artist, which renders shots with the chosen model and refines detail control through a multi-round self-reflection loop. The method enables shot-level content preservation and adaptive, style-consistent rendering by integrating ControlNets and temporal enhancements (AnimateDiff). A new benchmark, TVSBench, assesses condition alignment, temporal consistency, and video quality on complex videos with diverse open queries, and experiments show V-Stylist achieving state-of-the-art performance, e.g., improvements of 6.05% and 4.51% over FRESCO and ControlVideo respectively in overall metrics. The work demonstrates the practical impact of collaborative MLLM agents and reflection-driven optimization for robust, open-ended video editing, with potential to transform long-form video stylization tools and workflows.
Abstract
Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.
