Table of Contents
Fetching ...

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan

TL;DR

<3-5 sentence high-level summary> ARC-Hunyuan-Video-7B tackles the challenge of real-world short-video understanding by introducing structured video comprehension that fuses visual, audio, and textual signals with explicit temporal cues. Built on Hunyuan-7B, it adds an audio encoder and a timestamp overlay, and is trained via a multi-stage regimen including bootstrapped data, instruction fine-tuning, cold-start initialization, and GRPO-based reinforcement learning, ending with final instruction fine-tuning. A dedicated ShortVid-Bench benchmark and extensive qualitative and quantitative evaluations demonstrate strong performance in multi-granular captioning, temporal grounding, and audio-visual reasoning, with notable gains in downstream applications and real-world deployments achieving efficient inference. The work also provides open-source checkpoints, APIs, and inference code to accelerate research and practical adoption in structured video understanding for real-world shorts.

Abstract

Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

TL;DR

<3-5 sentence high-level summary> ARC-Hunyuan-Video-7B tackles the challenge of real-world short-video understanding by introducing structured video comprehension that fuses visual, audio, and textual signals with explicit temporal cues. Built on Hunyuan-7B, it adds an audio encoder and a timestamp overlay, and is trained via a multi-stage regimen including bootstrapped data, instruction fine-tuning, cold-start initialization, and GRPO-based reinforcement learning, ending with final instruction fine-tuning. A dedicated ShortVid-Bench benchmark and extensive qualitative and quantitative evaluations demonstrate strong performance in multi-granular captioning, temporal grounding, and audio-visual reasoning, with notable gains in downstream applications and real-world deployments achieving efficient inference. The work also provides open-source checkpoints, APIs, and inference code to accelerate research and practical adoption in structured video understanding for real-world shorts.

Abstract

Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.

Paper Structure

This paper contains 48 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Model capabilities of ARC-Hunyuan-Video-7B, which supports multi-granular timestamped captioning (output time span and corresponding description), summarization, temporal grounding, and open-ended question answering through integrating and reasoning over both visual and audio cues in the user-generated short videos.
  • Figure 2: (a) Model architecture. Built upon the Hunyuan-7B VLM, we incorporate an audio encoder with fine-grained visual-audio synchronization to obtain temporally aligned multimodal inputs. Timestamps are overlaid on visual frames to provide the model with temporal awareness.(b) Training stages including pre-training, instruction fine-tuning, cold start initialization, RL post-training and final instruction fine-tuning using high-quality human-annotated data and trajectories selected via rejection sampling.
  • Figure 3: Our automated bootstrapped annotation pipeline for pre-training. It extracts timestamped speech via ASR model and frame-level descriptions via MLLM; these, along with meta information (e.g., title), are input to an LLM for initial video annotation. The annotated data is used to train an initial version of the model, whose inference results are further integrated to produce the final annotations.
  • Figure 4: An example of ARC-Hunyuan-Video-7B. Given an instructional short video, our model can accurately identify and summarize the content of each step along with the corresponding time spans. For specific questions, the model is also able to locate the relevant time segments within the video, thereby providing precise answers.
  • Figure 5: An example of ARC-Hunyuan-Video-7B. Given a real-world video with excellent audiovisual quality, our model can analyze the video from visual, auditory, and thematic perspectives, and through reasoning, provide fine-grained segment recommendations.
  • ...and 7 more figures