Table of Contents
Fetching ...

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Y. Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Junhua Huang, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

TL;DR

The paper surveys post-training methodologies for Video-LMMs, identifying three core pillars—Supervised Fine-Tuning (SFT) with chain-of-thought, Reinforcement Learning (RL) with verifiable objectives, and Test-Time Scaling (TTS) for enhanced inference. It provides a structured taxonomy linking modality integration, domain adaptation, CoT-based reasoning, and long-video considerations to RL design (PPO, DPO, GRPO) and reward architectures that emphasize temporal and spatial grounding. The review aggregates data resources, benchmarks, and evaluation protocols, highlighting data scarcity, reward design challenges, and the need for standardized, verifier-ready assessment across long-form and streaming video tasks. It emphasizes practical guidelines for building robust, efficient Video-LMMs, with future directions focused on structured grounding, verifier-in-the-loop synthesis, tool-augmented inference, and scalable evaluation. Overall, the survey aims to unify best practices for advancing Video-LMM reasoning from post-training to deployment, addressing efficiency, fidelity, and scalability concerns in real-world settings.

Abstract

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

TL;DR

The paper surveys post-training methodologies for Video-LMMs, identifying three core pillars—Supervised Fine-Tuning (SFT) with chain-of-thought, Reinforcement Learning (RL) with verifiable objectives, and Test-Time Scaling (TTS) for enhanced inference. It provides a structured taxonomy linking modality integration, domain adaptation, CoT-based reasoning, and long-video considerations to RL design (PPO, DPO, GRPO) and reward architectures that emphasize temporal and spatial grounding. The review aggregates data resources, benchmarks, and evaluation protocols, highlighting data scarcity, reward design challenges, and the need for standardized, verifier-ready assessment across long-form and streaming video tasks. It emphasizes practical guidelines for building robust, efficient Video-LMMs, with future directions focused on structured grounding, verifier-in-the-loop synthesis, tool-augmented inference, and scalable evaluation. Overall, the survey aims to unify best practices for advancing Video-LMM reasoning from post-training to deployment, addressing efficiency, fidelity, and scalability concerns in real-world settings.

Abstract

Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

Paper Structure

This paper contains 72 sections, 16 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of Video-LMM post-training and the scope of this survey.
  • Figure 2: Research trends in Video-LMM post-training (November 2024 - September 2025). The word cloud is based on the titles of the papers.
  • Figure 3: Taxonomy of Video-LMM post-training.