Table of Contents
Fetching ...

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei

TL;DR

UniVA presents an open-source, end-to-end video generalist that unifies understanding, editing, segmentation, and generation within a Plan-Act dual-agent framework. By coupling a Planner with an Actor through the MCP-based tool network and a hierarchical memory system, UniVA enables proactive, long-horizon video workflows with strong traceability. The contributions include the UniVA platform, a modular production engine, and UniVA-Bench for multi-step evaluation, all designed to scale through plug-and-play tool integration. Experimental results on UniVA-Bench demonstrate both breadth and depth, revealing agentic synergy that outperforms isolated end-to-end models on multiple tasks and metrics. This work significantly advances interactive, general-purpose video intelligence with practical impact for real-world media production and multimodal AI systems.

Abstract

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

TL;DR

UniVA presents an open-source, end-to-end video generalist that unifies understanding, editing, segmentation, and generation within a Plan-Act dual-agent framework. By coupling a Planner with an Actor through the MCP-based tool network and a hierarchical memory system, UniVA enables proactive, long-horizon video workflows with strong traceability. The contributions include the UniVA platform, a modular production engine, and UniVA-Bench for multi-step evaluation, all designed to scale through plug-and-play tool integration. Experimental results on UniVA-Bench demonstrate both breadth and depth, revealing agentic synergy that outperforms isolated end-to-end models on multiple tasks and metrics. This work significantly advances interactive, general-purpose video intelligence with practical impact for real-world media production and multimodal AI systems.

Abstract

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation multi-round editing object segmentation compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

Paper Structure

This paper contains 73 sections, 4 equations, 24 figures, 4 tables.

Figures (24)

  • Figure 1: UniVA (Universal Video Agent) delivers a highly automated, interactive, and proactive video creation experience, featuring multi-round dialogue co-creation, memory-based contextual reasoning, intent understanding, and tool-augmented planning for iterative user interaction. It also serves as an omnipotent, unified, industrial-grade video production engine, integrating diverse generation, editing, and understanding modules within an MCP-based framework to ensure cinematic quality, consistency, and extensibility across any-conditioned video tasks.
  • Figure 2: Overall architecture of the proposed UniVA system, built on a Plan–Act paradigm. The Plan Agent decomposes user input (text, image, or video) into subtasks by leveraging global memory (historical traces) and user memory (stored materials). The Act Agent retrieves task-specific memory, executes subtasks via the MCP protocol, and coordinates with external MCP servers (video, AI, and non-AI tools). The system generates multimodal outputs, including text, image, video, and audio.
  • Figure 3: Memory-augmented framework for video generation. Global and user memories provide context to the plan agent, while task memory coordinates tool calling, storyboard creation, and video generation.
  • Figure 4: Iterative tool calling for video generation in UniVA. Left: one-prompt task applies a global ink-painting style. Right: multi-round task incrementally edits via segmentation, background change, and extension, demonstrating representative functions.
  • Figure 5: The interface combines a traditional non-linear timeline and preview canvas with a conversational assistant (left), which provides a user-friendly entry point to the UniVA agent. This design supports both one-stop, prompt-based generation and multi-turn, interactive editing workflows.
  • ...and 19 more figures