Table of Contents
Fetching ...

Vidi2: Large Multimodal Models for Video Understanding and Creation

Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao, Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin

TL;DR

Vidi2 tackles end-to-end spatio-temporal grounding for videos and extends multimodal reasoning to Video QA, enabling a text query to yield both the relevant time range and bounding-box tubes. It introduces two benchmarks, VUE-STG and VUE-TR-V2, to evaluate long-form, fine-grained grounding and temporal retrieval under realistic conditions with manually annotated ground-truths and refined metrics such as $vIoU$, $tIoU$, and AUC-based scores. The method upgrades a multimodal backbone (12B) with adaptive token compression, augmented training data (including STG- and QA-focused data), and a dedicated STG/data synthesis pipeline, achieving end-to-end localization and reasoning across video modalities. Empirical results show Vidi2 outperforms leading proprietary systems like Gemini 3 Pro (Preview) and GPT-5 on STG and TR benchmarks and remains competitive with open-source baselines on Video QA, demonstrating strong practical impact for editing, storytelling, and content creation tasks. Together, these contributions establish Vidi2 as a robust foundation for intelligent, composition-aware video understanding and generation systems.

Abstract

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

Vidi2: Large Multimodal Models for Video Understanding and Creation

TL;DR

Vidi2 tackles end-to-end spatio-temporal grounding for videos and extends multimodal reasoning to Video QA, enabling a text query to yield both the relevant time range and bounding-box tubes. It introduces two benchmarks, VUE-STG and VUE-TR-V2, to evaluate long-form, fine-grained grounding and temporal retrieval under realistic conditions with manually annotated ground-truths and refined metrics such as , , and AUC-based scores. The method upgrades a multimodal backbone (12B) with adaptive token compression, augmented training data (including STG- and QA-focused data), and a dedicated STG/data synthesis pipeline, achieving end-to-end localization and reasoning across video modalities. Empirical results show Vidi2 outperforms leading proprietary systems like Gemini 3 Pro (Preview) and GPT-5 on STG and TR benchmarks and remains competitive with open-source baselines on Video QA, demonstrating strong practical impact for editing, storytelling, and content creation tasks. Together, these contributions establish Vidi2 as a robust foundation for intelligent, composition-aware video understanding and generation systems.

Abstract

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

Paper Structure

This paper contains 33 sections, 9 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Examples of spatio-temporal grounding queries and their corresponding time ranges and object tubes (timestamps with bounding boxes, shown in yellow). Bounding boxes are expressed in percentage coordinates. The example video has a total duration of $387$ seconds (i.e., $\texttt{06:27}$), and the query has been converted into a noun-style format. Facial regions are blurred to protect privacy.
  • Figure 2: The video distribution comparison between the VUE-TR-V2 and VUE-TR benchmarks.
  • Figure 3: The distribution of query modality and format in the VUE-TR-V2 benchmark.
  • Figure 4: Overall performance curves for temporal retrieval on the VUE-TR-V2 benchmark. We report accuracy across varying thresholds for different models.
  • Figure 5: Example of highlight extraction application.
  • ...and 2 more figures