Table of Contents
Fetching ...

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

Meilong Xu, Di Fu, Jiaxing Zhang, Gong Yu, Jiayu Zheng, Xiaoling Hu, Dongdi Zhao, Feiyang Li, Chao Chen, Yong Cao

TL;DR

The paper tackles domain adaptation for video classification with large vision-language models (LVLMs) under limited labeled data, identifying a rationale gap between general pretraining and domain-specific semantics. It proposes Rationale-Bootstrapped Fine-Tuning (RB-FT), a two-stage approach: Stage I generates detailed rationales $r_i$ for each video using a structured prompt, and Stage II fine-tunes on ground-truth labels starting from the rationale-aligned model $M_{inter}$. Empirical results on SmartHome-LLM and MultiHateClip show robust gains over direct SFT and zero-shot baselines, with notable improvements on underrepresented classes and more grounded attention representations. The work demonstrates annotation-efficient domain adaptation and enhances interpretability through rationale-grounded reasoning and focused attention.

Abstract

Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

TL;DR

The paper tackles domain adaptation for video classification with large vision-language models (LVLMs) under limited labeled data, identifying a rationale gap between general pretraining and domain-specific semantics. It proposes Rationale-Bootstrapped Fine-Tuning (RB-FT), a two-stage approach: Stage I generates detailed rationales for each video using a structured prompt, and Stage II fine-tunes on ground-truth labels starting from the rationale-aligned model . Empirical results on SmartHome-LLM and MultiHateClip show robust gains over direct SFT and zero-shot baselines, with notable improvements on underrepresented classes and more grounded attention representations. The work demonstrates annotation-efficient domain adaptation and enhances interpretability through rationale-grounded reasoning and focused attention.

Abstract

Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

Paper Structure

This paper contains 15 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the proposed Rationale-Bootstrapped Fine-Tuning Framework. The pipeline consists of three phases: (Top) Offline Data Construction: A pre-trained VLM $M_\theta$ is leveraged to generate detailed textual rationales $r_i$ for training videos using a structured prompt $P_{rationale}$. This prompt conditions the model to adopt a specific expert persona (e.g., Smart Home Security Expert) and analyze the video across four semantic dimensions: subjects, attributes, actions, and scenes. (Bottom Left) Stage-I (Rationale-Enhanced Self-Improvement): The model is supervised fine-tuned to generate these domain-specific rationales, producing an intermediate model $M_{inter}$ with enhanced reasoning capabilities. (Bottom Right) Stage-II (Task-specific Label Alignment): The model undergoes a second stage of fine-tuning to predict the final ground-truth labels (e.g., <abnormal>), yielding the final optimized model $M^*_\theta$.
  • Figure 2: Comparative visualization of attention maps between the baseline Direct-SFT model (a) and our proposed RB-FT model (b). The heatmaps (purple to yellow) represent attention intensity, scaled from 0 to 1. The RB-FT model (b) demonstrates significantly improved focal accuracy, concentrating high-intensity attention on the salient subjects (the black bear and the trash can), whereas the Direct-SFT model (a) exhibits diffuse attention, failing to localize the critical regions of interest.