Table of Contents
Fetching ...

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Shicheng Li, Lei Li, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu Sun

TL;DR

<3-5 sentence high-level summary> TEMPLE tackles the shortage of temporal supervision in Video LLMs by introducing a Direct Preference Optimization (DPO) framework paired with an automated, temporality-focused data pipeline. A core innovation is Progressive Pre-SFT Alignment, featuring curriculum-based perturbation difficulty and a pre-instruction tuning alignment stage to instill fine-grained temporal understanding before broad instruction-following capabilities. The data pipeline automatically selects temporally rich videos, applies targeted perturbations, and generates clean/perturbed response pairs without external LLMs, enabling scalable temporal supervision. Empirical results across multiple benchmarks show consistent temporal and general video-understanding gains, with strong evidence that self-generated DPO data and pre-SFT temporal alignment yield the biggest benefits.

Abstract

Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm}, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference LEarning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

TL;DR

<3-5 sentence high-level summary> TEMPLE tackles the shortage of temporal supervision in Video LLMs by introducing a Direct Preference Optimization (DPO) framework paired with an automated, temporality-focused data pipeline. A core innovation is Progressive Pre-SFT Alignment, featuring curriculum-based perturbation difficulty and a pre-instruction tuning alignment stage to instill fine-grained temporal understanding before broad instruction-following capabilities. The data pipeline automatically selects temporally rich videos, applies targeted perturbations, and generates clean/perturbed response pairs without external LLMs, enabling scalable temporal supervision. Empirical results across multiple benchmarks show consistent temporal and general video-understanding gains, with strong evidence that self-generated DPO data and pre-SFT temporal alignment yield the biggest benefits.

Abstract

Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm}, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference LEarning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

Paper Structure

This paper contains 32 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Example of detailed captioning result of the original Qwen2-VL-7B and our approach.
  • Figure 2: Illustration of our DPO data generation pipeline.
  • Figure 3: Illustration of video perturbation strategies controlled by the difficulty factor $r$.
  • Figure 4: SFT loss and gradient norm. Pre-SFT Alignment leads to lower loss and more stable gradients.
  • Figure 5: Performance comparison between different tuning strategies on VideoMME, MLVU, and Vinoground (Text).
  • ...and 1 more figures