Table of Contents
Fetching ...

Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture

Cheng Li, Jiexiong Liu, Yixuan Chen, Yanqin Jia

TL;DR

The paper tackles inefficiency and multimodal processing challenges in video-language pretraining by introducing KunLunBaize-VoT-R1, a framework that integrates a long-sequence image encoder with image packing, an Autonomy-of-Experts (AoE) routing scheme, and a Video-of-Thought (VoT) reasoning pipeline. It employs a hybrid attention mechanism to reduce computational complexity, AoE-based adaptive feature fusion, dense residuals for richer representations, and size-aware embedding to handle heterogeneous inputs, all trained with distillation and contrastive objectives alongside multi-stage reinforcement learning. The VoT framework couples pixel-level grounding via STSG with a multimodal LLM to perform multi-step reasoning, causal inference, and verification, guided by a curriculum of adapters, LoRA-tuned LLMs, and human-preference rewards. Empirical results demonstrate state-of-the-art or strong zero-shot performance on diverse video QA benchmarks and robustness across long videos, with ablation analyses confirming the necessity of each component for peak performance. The work offers a practical path toward efficient, scalable video-language understanding with real-world implications for video QA, temporal localization, and multimodal reasoning systems.

Abstract

In the field of video-language pretraining, existing models face numerous challenges in terms of inference efficiency and multimodal data processing. This paper proposes a KunLunBaize-VoT-R1 video inference model based on a long-sequence image encoder, along with its training and application methods. By integrating image packing technology, the Autonomy-of-Experts (AoE) architecture, and combining the video of Thought (VoT), a large language model (LLM) trained with large-scale reinforcement learning, and multiple training techniques, the efficiency and accuracy of the model in video inference tasks are effectively improved. Experiments show that this model performs outstandingly in multiple tests, providing a new solution for video-language understanding.

Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture

TL;DR

The paper tackles inefficiency and multimodal processing challenges in video-language pretraining by introducing KunLunBaize-VoT-R1, a framework that integrates a long-sequence image encoder with image packing, an Autonomy-of-Experts (AoE) routing scheme, and a Video-of-Thought (VoT) reasoning pipeline. It employs a hybrid attention mechanism to reduce computational complexity, AoE-based adaptive feature fusion, dense residuals for richer representations, and size-aware embedding to handle heterogeneous inputs, all trained with distillation and contrastive objectives alongside multi-stage reinforcement learning. The VoT framework couples pixel-level grounding via STSG with a multimodal LLM to perform multi-step reasoning, causal inference, and verification, guided by a curriculum of adapters, LoRA-tuned LLMs, and human-preference rewards. Empirical results demonstrate state-of-the-art or strong zero-shot performance on diverse video QA benchmarks and robustness across long videos, with ablation analyses confirming the necessity of each component for peak performance. The work offers a practical path toward efficient, scalable video-language understanding with real-world implications for video QA, temporal localization, and multimodal reasoning systems.

Abstract

In the field of video-language pretraining, existing models face numerous challenges in terms of inference efficiency and multimodal data processing. This paper proposes a KunLunBaize-VoT-R1 video inference model based on a long-sequence image encoder, along with its training and application methods. By integrating image packing technology, the Autonomy-of-Experts (AoE) architecture, and combining the video of Thought (VoT), a large language model (LLM) trained with large-scale reinforcement learning, and multiple training techniques, the efficiency and accuracy of the model in video inference tasks are effectively improved. Experiments show that this model performs outstandingly in multiple tests, providing a new solution for video-language understanding.

Paper Structure

This paper contains 31 sections, 20 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The overall architecture of the proposed encoder demonstrates the integration of key components, including the hybrid attention mechanism, the Autonomy-of-Experts (AoE) model, dense learnable residual connections, and sample packing technology.
  • Figure 2: VoT reasoning framework