Table of Contents
Fetching ...

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

Zhengrong Yue, Shaobin Zhuang, Kunchang Li, Yanbo Ding, Yali Wang

TL;DR

V-Stylist tackles open-query, long-video stylization by orchestrating three MLLM-based agents: Video Parser, which segments videos into shots and generates per-shot prompts; Style Parser, which identifies and selects an appropriate style model via a tree-of-thought search over a style model tree; and Style Artist, which renders shots with the chosen model and refines detail control through a multi-round self-reflection loop. The method enables shot-level content preservation and adaptive, style-consistent rendering by integrating ControlNets and temporal enhancements (AnimateDiff). A new benchmark, TVSBench, assesses condition alignment, temporal consistency, and video quality on complex videos with diverse open queries, and experiments show V-Stylist achieving state-of-the-art performance, e.g., improvements of 6.05% and 4.51% over FRESCO and ControlVideo respectively in overall metrics. The work demonstrates the practical impact of collaborative MLLM agents and reflection-driven optimization for robust, open-ended video editing, with potential to transform long-form video stylization tools and workflows.

Abstract

Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

TL;DR

V-Stylist tackles open-query, long-video stylization by orchestrating three MLLM-based agents: Video Parser, which segments videos into shots and generates per-shot prompts; Style Parser, which identifies and selects an appropriate style model via a tree-of-thought search over a style model tree; and Style Artist, which renders shots with the chosen model and refines detail control through a multi-round self-reflection loop. The method enables shot-level content preservation and adaptive, style-consistent rendering by integrating ControlNets and temporal enhancements (AnimateDiff). A new benchmark, TVSBench, assesses condition alignment, temporal consistency, and video quality on complex videos with diverse open queries, and experiments show V-Stylist achieving state-of-the-art performance, e.g., improvements of 6.05% and 4.51% over FRESCO and ControlVideo respectively in overall metrics. The work demonstrates the practical impact of collaborative MLLM agents and reflection-driven optimization for robust, open-ended video editing, with potential to transform long-form video stylization tools and workflows.

Abstract

Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.

Paper Structure

This paper contains 16 sections, 4 equations, 20 figures, 4 tables, 1 algorithm.

Figures (20)

  • Figure 1: Overview. V-Stylist is a multi-agent system for complex video stylization with open user query. It contains three key roles to address the primary challenges in video stylization. Video Parser decomposes a video into several shots and prompts, via a video-to-shot prompting paradigm. Style Parser finds the most suitable stylization model, via a tree-of-thought searching paradigm. Style Artist adaptively adjusts detail control for preferable shot rendering, via a multi-round style reflection paradigm.
  • Figure 2: Video Parser. First, Shot Detector splits video into a number of shots, based on transitions. Then, Shot Captioner generates caption to describe key content of each shot. Finally, Shot Translator converts shot captions into text prompts for diffusion later on.
  • Figure 3: Style Parser. First, Style Identifier finds the style preference from open user query. Second, Style Tree Builder constructs a style tree based on dependency of various styles. Finally, Style Searcher uses tree-of-thought to search the matched style model.
  • Figure 4: Style Artist. In the Style Render, we leverage the matched style model to convert a video shot into the required style, based on the current control weights. In the Style Reflection, we use a Style Scorer to evaluate if the stylized shot is satisfactory. If not, we use a Control Refiner to generate new control weights for stylization in the next round. The two steps iterate alternately, for progressively and adaptively enhancing visual details of stylization.
  • Figure 5: Qualitative Comparison with State-of-the-Art Methods. Our V-Stylist achieves the best in terms of condition alignment, temporal consistency, and video quality, outperforming open-source state-of-the-art methods, such as ControlVideo controlvideo, Rerender-A-Video rerender, and FRESCO fresco.
  • ...and 15 more figures