Table of Contents
Fetching ...

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

Hongchen Wei, Zhihong Tan, Yaosi Hu, Chang Wen Chen, Zhenzhong Chen

TL;DR

We address the bottleneck in long-video captioning for open-source large multimodal models by identifying the scarcity of long-caption training data as the primary limiter. To overcome this, we introduce LongCaption-Agent, a hierarchical synthesis framework that generates frame-, clip-, and video-level descriptions using off-the-shelf LMMs and LLMs, and we release LongCaption-10K, a 10k-sample long-caption dataset. To reliably evaluate long captions, we propose LongCaption-Bench, a 281-video benchmark with length, quality, and relevance metrics tailored for long-form outputs. Training a LongCaptioning model on LongCaption-10K, aided by a visual context window extension, yields captions exceeding 1,000 words with high quality and relevance, surpassing some larger proprietary models on LongCaption-Bench and demonstrating practical gains for long-form video understanding. The work provides a cost-effective path to unlock long-caption generation in open-source LMMs and offers datasets, benchmarks, and methodologies to propel future research in long-video captioning."

Abstract

Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples for long-form videos is time-consuming and expensive. To overcome the annotation bottleneck, we propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation. % aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words for long-form videos, while maintaining high output quality. In LongCaption-Bench, our model achieved State-of-The-Art performance, even surpassing larger proprietary models like GPT4o.

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

TL;DR

We address the bottleneck in long-video captioning for open-source large multimodal models by identifying the scarcity of long-caption training data as the primary limiter. To overcome this, we introduce LongCaption-Agent, a hierarchical synthesis framework that generates frame-, clip-, and video-level descriptions using off-the-shelf LMMs and LLMs, and we release LongCaption-10K, a 10k-sample long-caption dataset. To reliably evaluate long captions, we propose LongCaption-Bench, a 281-video benchmark with length, quality, and relevance metrics tailored for long-form outputs. Training a LongCaptioning model on LongCaption-10K, aided by a visual context window extension, yields captions exceeding 1,000 words with high quality and relevance, surpassing some larger proprietary models on LongCaption-Bench and demonstrating practical gains for long-form video understanding. The work provides a cost-effective path to unlock long-caption generation in open-source LMMs and offers datasets, benchmarks, and methodologies to propel future research in long-video captioning."

Abstract

Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples for long-form videos is time-consuming and expensive. To overcome the annotation bottleneck, we propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation. % aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words for long-form videos, while maintaining high output quality. In LongCaption-Bench, our model achieved State-of-The-Art performance, even surpassing larger proprietary models like GPT4o.

Paper Structure

This paper contains 25 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: The output length of LMMs varies with different duration. The maximum output length of open-source LMMs is around 300 words, which is significantly shorter than that of proprietary models.
  • Figure 2: The LongCaption-Agent framework. The framework uses an off-the-shelf LMM ($i.e.$, MiniCPMV2.6-8B yao2024minicpm) to first generate frame-level descriptions for each sampled frame, and then iteratively produce clip-level descriptions for each clip. Finally, an off-the-shelf LLM ($i.e.$, GLM4-Long glm2024chatglm) is used to integrate the frame-level and clip-level descriptions into a complete video long-caption, incorporating additional context such as timestamps.
  • Figure 3: Existing Video-Text Datasets
  • Figure 4: Key statistics of LongCaption-10k.
  • Figure 5: The framework of the Large Multimodal Model (LMM). The model consists of a vision encoder that processes video frames and generates vision tokens. These tokens are then passed through a compression layer that reduces the number of vision tokens. The compression layer uses weighted sum and attention mechanisms, incorporating vision tokens, and learnable queries tokens, where the number of learnable queries is much smaller than the number of value and key. The processed information is fed into a Large Language Model (LLM), which leverages LoRA (Low-Rank Adaptation) to train language model.
  • ...and 6 more figures