Table of Contents
Fetching ...

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu

Abstract

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Abstract

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.
Paper Structure (45 sections, 28 equations, 8 figures, 8 tables)

This paper contains 45 sections, 28 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of the temporal trap and our study.(A) Illustration of the temporal trap: Video-SFT improves video performance but can weaken spatial capability on static images. (B) Image and video benchmarks used in our experiments. (C) Main study dimensions: architecture, scale, frame budget, and task. (D)Hybrid-Frame Strategy, which adaptively assigns a suitable frame budget to each training sample and partially mitigates the image--video trade-off.
  • Figure 2: Comparison of image and video benchmark performance before and after Video-SFT across different MLLMs. $\blacktriangledown$ denotes performance degradation after SFT, whereas $\blacktriangle$ indicates performance improvement.
  • Figure 3: Cross-scale attention visualizations before and after Video-SFT on Qwen2.5-VL models (7B, 32B, 72B). For the query “Is there a bird in the image?”, attention becomes more dispersed in smaller models after Video-SFT, while larger models retain more localized focus on the target object, suggesting improved robustness to the temporal trap.
  • Figure 4: Comparison of image and video benchmark performance before and after Video-SFT across different scale Qwen2.5-VL models, including 3B, 7B, 32B and 72B parameters.
  • Figure 5: Comparison of performance on image and video benchmarks before and after Video-SFT across 8/16/32/64 training frames on Qwen2.5-VL-7B model.
  • ...and 3 more figures