LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

Hongchen Wei; Zhihong Tan; Yaosi Hu; Chang Wen Chen; Zhenzhong Chen

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

Hongchen Wei, Zhihong Tan, Yaosi Hu, Chang Wen Chen, Zhenzhong Chen

TL;DR

We address the bottleneck in long-video captioning for open-source large multimodal models by identifying the scarcity of long-caption training data as the primary limiter. To overcome this, we introduce LongCaption-Agent, a hierarchical synthesis framework that generates frame-, clip-, and video-level descriptions using off-the-shelf LMMs and LLMs, and we release LongCaption-10K, a 10k-sample long-caption dataset. To reliably evaluate long captions, we propose LongCaption-Bench, a 281-video benchmark with length, quality, and relevance metrics tailored for long-form outputs. Training a LongCaptioning model on LongCaption-10K, aided by a visual context window extension, yields captions exceeding 1,000 words with high quality and relevance, surpassing some larger proprietary models on LongCaption-Bench and demonstrating practical gains for long-form video understanding. The work provides a cost-effective path to unlock long-caption generation in open-source LMMs and offers datasets, benchmarks, and methodologies to propel future research in long-video captioning."

Abstract

Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples for long-form videos is time-consuming and expensive. To overcome the annotation bottleneck, we propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation. % aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words for long-form videos, while maintaining high output quality. In LongCaption-Bench, our model achieved State-of-The-Art performance, even surpassing larger proprietary models like GPT4o.

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

TL;DR

Abstract

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)