Table of Contents
Fetching ...

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

TL;DR

Video-STaR introduces a novel self-training loop that reuses labeled video datasets for visual instruction tuning by generating answers, rationalizing labels when needed, and verifying label containment. By cycling data generation with finetuning and enforcing weak supervision through a Parser-Verifier, it enables incorporating diverse video supervision beyond caption-style prompts. The approach yields strong zero-shot QA gains (e.g., TempCompass) and substantial improvements on adapted tasks (e.g., Kinetics700, FineDiving) and results in the VSTaR-1M dataset, illustrating broad applicability across domains. These results suggest that weakly supervised, cycle-based self-training can significantly enhance LVLMs’ video understanding and cross-domain adaptability, while highlighting areas for further refinement such as computational efficiency and reducing hallucinations.

Abstract

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

TL;DR

Video-STaR introduces a novel self-training loop that reuses labeled video datasets for visual instruction tuning by generating answers, rationalizing labels when needed, and verifying label containment. By cycling data generation with finetuning and enforcing weak supervision through a Parser-Verifier, it enables incorporating diverse video supervision beyond caption-style prompts. The approach yields strong zero-shot QA gains (e.g., TempCompass) and substantial improvements on adapted tasks (e.g., Kinetics700, FineDiving) and results in the VSTaR-1M dataset, illustrating broad applicability across domains. These results suggest that weakly supervised, cycle-based self-training can significantly enhance LVLMs’ video understanding and cross-domain adaptability, while highlighting areas for further refinement such as computational efficiency and reducing hallucinations.

Abstract

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.
Paper Structure (36 sections, 3 equations, 10 figures, 7 tables)

This paper contains 36 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Video-STaR Overview. Video-STaR can utilize any labeled video dataset, including AR (Action Recognition), AQA (Action Quality Assessment), and TAL (Temporal Action Localization) -- from which it generates video instruction tuning data (video, question, answer triplets). Internally, Video-STaR cycles between: (I) Answer Generation, where an LVLM is prompted to generate candidate answers for the questions. (II) Label Verification where generated answers are filtered to only those that contain the video labels. And (III) Instruction Tuning, where a model is retrained on answers that pass verification. These cycles continue until performance plateaus.
  • Figure 2: Video Self-Training with augmented Reasoning.(\ref{['sec:meth:rationale']}) We initialize by prompting an LVLM to generate an answer for a particular video. (\ref{['sec:meth:filter']}) We then filter the generated answers to those only containing the original video labels. (\ref{['sec:meth:rationalization']}) The videos whose generated answer did not contain the ground-truth labels are then sent to label rationalization, where given the video, question, and label - the model is expected to rationalize the label. (\ref{['sec:meth:filter']}) The generated answers are filtered again to those only containing the ground-truth labels, and (\ref{['sec:meth']}) the LVLM is instruction-tuned from the pre-trained checkpoint on the resulting dataset. The cycle is then repeated.
  • Figure 3: Qualitative Improvement of Data Generation over Cycles on FineDiving. We initialize the model with Video-LLaVA (Cycle 0), where the model cannot generate an answer ($\rightarrow\vert \times$) or rationalize the label correctly ($\vert\rightarrow \times$). In the second cycle (Cycle 1), the model still cannot generate an answer ($\rightarrow\vert \times$) but can rationalize the video label ($\checkmark\vert\rightarrow$), which is selected for instruction tuning. Finally, in the third cycle (Cycle 2), the model directly generates a correct answer ($\checkmark\vert\rightarrow$), which is selected for visual instruction tuning. We highlight in green correct answers, in red wrong answers, and in yellow - hallucinations.
  • Figure 4: Dataset Yield vs. Cycles. Percentage of the videos converted to instruction tuning by the Answer Generation and Label Rationalization per dataset. As can be seen, on difficult datasets, such as FineDiving, no videos are converted by Answer Generation in the first cycle. By utilizing Label Rationalization, the model is able to improve to eventually generate answers correctly.
  • Figure 5: Action Quality Assessments by Video-STaR on the FineDiving Test Set. Different diving sequences with corresponding Video-STaR evaluations, from a high score of $85.78$ for complex sequences (top) to $74.8$ for intermediate (middle), and a lower score of $54.6$ for basic sequences (bottom), showcasing Video-STaR's proficiency in scoring dives with varying degrees of difficulty and execution quality.
  • ...and 5 more figures