SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang
TL;DR
The paper addresses the gap in fine-grained spatiotemporal understanding in Video-LLMs by introducing SF$^2$T, a self-supervised fragment-finetuning framework that trains models on five fragment-level tasks derived from video dynamics without manual annotations. It also introduces FineVidBench, a comprehensive benchmark with scene- and fragment-level evaluations (910 videos and 22,718 QA pairs) to rigorously assess spatiotemporal comprehension under varying speeds and frame sequences. Experimental results show that SF$^2$T broadly improves Video-LLM performance over baselines and complements supervised fine-tuning, particularly enhancing temporal sensitivity and sequence reasoning. The combination of SF$^2$T and FineVidBench offers a scalable path to bolster fine-grained video perception in Video-LLMs and provides a robust evaluation suite for future research.
Abstract
Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.
