Addressing the ID-Matching Challenge in Long Video Captioning
Authors
Zhantao Yang, Huangji Wang, Ruili Feng, Han Zhang, Yuting Hu, Shangwen Zhu, Junyan Li, Yu Liu, Fan Cheng
Abstract
Generating captions for long and complex videos is both critical and
challenging, with significant implications for the growing fields of
text-to-video generation and multi-modal understanding. One key challenge in
long video captioning is accurately recognizing the same individuals who appear
in different frames, which we refer to as the ID-Matching problem. Few prior
works have focused on this important issue. Those that have, usually suffer
from limited generalization and depend on point-wise matching, which limits
their overall effectiveness. In this paper, unlike previous approaches, we
build upon LVLMs to leverage their powerful priors. We aim to unlock the
inherent ID-Matching capabilities within LVLMs themselves to enhance the
ID-Matching performance of captions. Specifically, we first introduce a new
benchmark for assessing the ID-Matching capabilities of video captions. Using
this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights
that the performance of ID-Matching can be improved through two methods: 1)
enhancing the usage of image information and 2) increasing the quantity of
information of individual descriptions. Based on these insights, we propose a
novel video captioning method called Recognizing Identities for Captioning
Effectively (RICE). Extensive experiments including assessments of caption
quality and ID-Matching performance, demonstrate the superiority of our
approach. Notably, when implemented on GPT-4o, our RICE improves the precision
of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15%
to 80% compared to baseline. RICE makes it possible to continuously track
different individuals in the captions of long videos.