Table of Contents
Fetching ...

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang

TL;DR

The paper addresses the gap in fine-grained spatiotemporal understanding in Video-LLMs by introducing SF$^2$T, a self-supervised fragment-finetuning framework that trains models on five fragment-level tasks derived from video dynamics without manual annotations. It also introduces FineVidBench, a comprehensive benchmark with scene- and fragment-level evaluations (910 videos and 22,718 QA pairs) to rigorously assess spatiotemporal comprehension under varying speeds and frame sequences. Experimental results show that SF$^2$T broadly improves Video-LLM performance over baselines and complements supervised fine-tuning, particularly enhancing temporal sensitivity and sequence reasoning. The combination of SF$^2$T and FineVidBench offers a scalable path to bolster fine-grained video perception in Video-LLMs and provides a robust evaluation suite for future research.

Abstract

Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

TL;DR

The paper addresses the gap in fine-grained spatiotemporal understanding in Video-LLMs by introducing SFT, a self-supervised fragment-finetuning framework that trains models on five fragment-level tasks derived from video dynamics without manual annotations. It also introduces FineVidBench, a comprehensive benchmark with scene- and fragment-level evaluations (910 videos and 22,718 QA pairs) to rigorously assess spatiotemporal comprehension under varying speeds and frame sequences. Experimental results show that SFT broadly improves Video-LLM performance over baselines and complements supervised fine-tuning, particularly enhancing temporal sensitivity and sequence reasoning. The combination of SFT and FineVidBench offers a scalable path to bolster fine-grained video perception in Video-LLMs and provides a robust evaluation suite for future research.

Abstract

Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SFT), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SFT on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.

Paper Structure

This paper contains 18 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Performance w/ and w/o SF$^2$T. We evaluated four advanced Video-LLMs w/ and w/o SF$^2$T on our proposed FineVidBench with two baselines: (1) Base: performance without any fine-tuning (blue dashed), and (2) Base (SFT): performance with supervised fine-tuning (red dashed). After applying SF$^2$T, all models showed significant improvements (solid blue and red), underscoring its broad effectiveness.
  • Figure 1: Four exemplary visualizations of the attention map on Qwen2-VL. For each example: top - Original frames; middle - Base (SFT); bottom - SF$^2$T applied. As highlighted by the red boxes, applying SF$^2$T enables the model to better focus on action execution areas and interacting objects, while also predicting the direction of motion.
  • Figure 2: We show the action semantics and their respective proportions in FineVidBench. Distinctive Action: easily recognizable actions. Non-typical Action: flexible actions with no clear characteristics, like "put" and "move." Slight Movement: subtle actions, such as "hold" and "show," difficult to detect with the naked eye.
  • Figure 3: FineVidBench evaluates videos augmented with speed variations and fragments. Scene-level tests include the following: Action: Tests recognition accuracy amidst distractors like "Visual Synonyms". Effect: Assesses the model’s ability to identify pre- and post-action changes. Speed: Measures the model’s sensitivity to changes in video speed. Fragment-level tests, employing a step-by-step inquiry framework, focus on challenges such as Frame Count, Meaning of Order, Frame Comparison, Adjust-or-Not and Rearrangement.
  • Figure 4: Accuracy across different video speeds. All models are more sensitive to slow-speed videos and struggle to understand "normal speed" and "no speed", except for VideoLLaMA 2.1.
  • ...and 2 more figures