Table of Contents
Fetching ...

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Enrico Guerriero, Kjersti Engan, Øyvind Meinich-Bache

TL;DR

Problem: fine-grained activity recognition in newborn resuscitation videos is essential for quality improvement but challenging due to subtle cues and privacy concerns. Approach: compare TimeSFormer baseline with local VLMs and LLM-based strategies, including ZSC variants and fine-tuning with classifier heads; use 13.26 hours of simulated data; apply LoRA. Contributions: zero-shot strategies struggle with hallucinations, while fine-tuning with a classifier head and LoRA achieves macro F1 of 0.91, surpassing TimeSformer 0.70. Significance: demonstrates viability of privacy-preserving, edge-based VLM/LLM pipelines for clinical video analysis and highlights the need for task-specific fine-tuning.

Abstract

Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

TL;DR

Problem: fine-grained activity recognition in newborn resuscitation videos is essential for quality improvement but challenging due to subtle cues and privacy concerns. Approach: compare TimeSFormer baseline with local VLMs and LLM-based strategies, including ZSC variants and fine-tuning with classifier heads; use 13.26 hours of simulated data; apply LoRA. Contributions: zero-shot strategies struggle with hallucinations, while fine-tuning with a classifier head and LoRA achieves macro F1 of 0.91, surpassing TimeSformer 0.70. Significance: demonstrates viability of privacy-preserving, edge-based VLM/LLM pipelines for clinical video analysis and highlights the need for task-specific fine-tuning.

Abstract

Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
Paper Structure (11 sections, 2 equations, 4 figures, 1 table)

This paper contains 11 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Examples of video frames from the different activities.
  • Figure 2: ZSC-J: VLM captioning + LLM judge. VLM is frozen, prompt and judge is tuned
  • Figure 3: Architecture of the Fine-Tuned Model: the fused representation of Prompt and Video is passed through the Classification Head. FT-LC has trainable Classifier Head, FT-C-LoRA have in addition trainable parameters within the Cross-Modal Attention block.
  • Figure 4: Trainable Parameters in the FT-LC (a) and FT-C-LoRA (a) and (b) models.