Table of Contents
Fetching ...

Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran

TL;DR

TemporalVLM addresses the challenge of temporal reasoning in long videos by representing content as time-aware local features extracted from overlapping short-term clips and globally aggregating them with a learnable BiLSTM before interfacing with a vision-language model. The approach yields state-of-the-art results across dense video captioning, temporal grounding, highlight detection, and temporal action segmentation, validated on standard benchmarks and the new IndustryASM dataset. IndustryASM provides 158 hours of factory-floor video with framewise annotations and timestamps to support time-sensitive action studies. By combining a time-aware clip encoder, a Video Q-Former fusion mechanism, BiLSTM-based global reasoning, and instruction-tuned LLMs, TemporalVLM offers improved temporal understanding for long videos and sets a foundation for future recurrent-model enhancements in video LLMs.

Abstract

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.

Video LLMs for Temporal Reasoning in Long Videos

TL;DR

TemporalVLM addresses the challenge of temporal reasoning in long videos by representing content as time-aware local features extracted from overlapping short-term clips and globally aggregating them with a learnable BiLSTM before interfacing with a vision-language model. The approach yields state-of-the-art results across dense video captioning, temporal grounding, highlight detection, and temporal action segmentation, validated on standard benchmarks and the new IndustryASM dataset. IndustryASM provides 158 hours of factory-floor video with framewise annotations and timestamps to support time-sensitive action studies. By combining a time-aware clip encoder, a Video Q-Former fusion mechanism, BiLSTM-based global reasoning, and instruction-tuned LLMs, TemporalVLM offers improved temporal understanding for long videos and sets a foundation for future recurrent-model enhancements in video LLMs.

Abstract

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.

Paper Structure

This paper contains 30 sections, 4 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Video LLMs are often not time-sensitive (a, b), consider an input video as a single clip (a, c), and apply pooling (a, b) or query aggregation (c) for aggregating global features. Our model (d) includes a time-aware clip encoder for extracting time-aware fine-grained cues and a BiLSTM for capturing long-range temporal dependencies.
  • Figure 2: TemporalVLM includes two novel components: a time-aware clip encoder for extracting time-aware fine-grained cues and a BiLSTM module for capturing long-range temporal dependencies.
  • Figure 3: Example IndustryASM videos with different camera viewpoints, actors, backgrounds, and activities.
  • Figure 4: Dense video captioning in zero-shot setting on YouCook2. Red denotes inaccuracies.
  • Figure 5: Temporal action segmentation in supervised setting on IndustryASM. Black denotes background.
  • ...and 8 more figures