Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
TL;DR
This work identifies temporal hacking as a core cause of the anti-scaling law in video-language MLLMs and formalizes it within a reinforcement-learning framework. It introduces Temporal Perplexity (TPL) to quantify proxy–true objective misalignment and presents Unhackable Temporal Rewarding (UTR), grounded in high frame information density and inter-frame dynamics, to reshape reward signals. Through Video-UTR, the method demonstrates improved video understanding with fewer data and parameter investments, supported by extensive benchmarks and ablations. The approach highlights the importance of aligning proxy rewards with true video-language objectives and offers scalable strategies for robust temporal modeling in video AI systems.
Abstract
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
