Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran
TL;DR
TemporalVLM addresses the challenge of temporal reasoning in long videos by representing content as time-aware local features extracted from overlapping short-term clips and globally aggregating them with a learnable BiLSTM before interfacing with a vision-language model. The approach yields state-of-the-art results across dense video captioning, temporal grounding, highlight detection, and temporal action segmentation, validated on standard benchmarks and the new IndustryASM dataset. IndustryASM provides 158 hours of factory-floor video with framewise annotations and timestamps to support time-sensitive action studies. By combining a time-aware clip encoder, a Video Q-Former fusion mechanism, BiLSTM-based global reasoning, and instruction-tuned LLMs, TemporalVLM offers improved temporal understanding for long videos and sets a foundation for future recurrent-model enhancements in video LLMs.
Abstract
We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.
