Table of Contents
Fetching ...

Unhackable Temporal Rewarding for Scalable Video MLLMs

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao

TL;DR

This work identifies temporal hacking as a core cause of the anti-scaling law in video-language MLLMs and formalizes it within a reinforcement-learning framework. It introduces Temporal Perplexity (TPL) to quantify proxy–true objective misalignment and presents Unhackable Temporal Rewarding (UTR), grounded in high frame information density and inter-frame dynamics, to reshape reward signals. Through Video-UTR, the method demonstrates improved video understanding with fewer data and parameter investments, supported by extensive benchmarks and ablations. The approach highlights the importance of aligning proxy rewards with true video-language objectives and offers scalable strategies for robust temporal modeling in video AI systems.

Abstract

In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.

Unhackable Temporal Rewarding for Scalable Video MLLMs

TL;DR

This work identifies temporal hacking as a core cause of the anti-scaling law in video-language MLLMs and formalizes it within a reinforcement-learning framework. It introduces Temporal Perplexity (TPL) to quantify proxy–true objective misalignment and presents Unhackable Temporal Rewarding (UTR), grounded in high frame information density and inter-frame dynamics, to reshape reward signals. Through Video-UTR, the method demonstrates improved video understanding with fewer data and parameter investments, supported by extensive benchmarks and ablations. The approach highlights the importance of aligning proxy rewards with true video-language objectives and offers scalable strategies for robust temporal modeling in video AI systems.

Abstract

In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.

Paper Structure

This paper contains 24 sections, 9 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Illustration of temporal hacking. We select a scene from the Zootopia to vividly illustrate the phenomenon of temporal hacking, where the fox is named Nick and the rabbit is named Judy. Humans watch videos frame by frame, gradually building an understanding of the content, following a “flow” similar to a Markov process. In contrast, MLLMs process the entire video and its content at once, which can cause them to take shortcuts by focusing only on the most relevant frames.
  • Figure 2: Analysis of the temporal hacking.(a) shows the relationship between temporal perplexity and true performance. The size of the radius of the circle represents the amount of data. (b) visualizes the attention map illustrating which specific frames the model’s output focuses on.
  • Figure 3: Overall pipeline of Unhackable Temporal Rewarding (UTR). UTR begins by using a mixture of expert models to extract unique spatiotemporal attributes and employs a tracking algorithm to construct multiple subject trajectories based on confidence levels (data modeling, top). It then performs bidirectional querying of temporal and spatial attributes to generate dialogue data (task modeling, bottom), thereby learning spatiotemporal dynamics.
  • Figure 4: Zero-shot spatial-temporal understanding performance on MM-IDji2024ida.
  • Figure 6: Quantitative comparison of the video-text pair with different temporal perplexity
  • ...and 5 more figures