Table of Contents
Fetching ...

Harnessing LLMs for Automated Video Content Analysis: An Exploratory Workflow of Short Videos on Depression

Jiaying Lizzy Liu, Yunlong Wang, Yao Lyu, Yiheng Su, Shuo Niu, Xuhai Orson Xu, Yan Zhang

TL;DR

This work addresses the challenge of applying LLMs to multimodal video content analysis by introducing a four-step workflow (codebook design, prompt engineering, LLM processing, and human evaluation) to analyze depression-related short videos. It emphasizes structured annotations and explainable rationales, using keyframes and transcripts to evaluate LLM capabilities against human coders. The study finds that LLMs reliably annotate observable objects and actions but struggle with abstract notions like emotion and genre, highlighting both potential and limitations of current LLMs in video analysis. The findings inform future directions for workflow improvements, multi-modal integration, and ethical guidelines to enable scalable, transparent analysis of video data while mitigating risks.

Abstract

Despite the growing interest in leveraging Large Language Models (LLMs) for content analysis, current studies have primarily focused on text-based content. In the present work, we explored the potential of LLMs in assisting video content analysis by conducting a case study that followed a new workflow of LLM-assisted multimodal content analysis. The workflow encompasses codebook design, prompt engineering, LLM processing, and human evaluation. We strategically crafted annotation prompts to get LLM Annotations in structured form and explanation prompts to generate LLM Explanations for a better understanding of LLM reasoning and transparency. To test LLM's video annotation capabilities, we analyzed 203 keyframes extracted from 25 YouTube short videos about depression. We compared the LLM Annotations with those of two human coders and found that LLM has higher accuracy in object and activity Annotations than emotion and genre Annotations. Moreover, we identified the potential and limitations of LLM's capabilities in annotating videos. Based on the findings, we explore opportunities and challenges for future research and improvements to the workflow. We also discuss ethical concerns surrounding future studies based on LLM-assisted video analysis.

Harnessing LLMs for Automated Video Content Analysis: An Exploratory Workflow of Short Videos on Depression

TL;DR

This work addresses the challenge of applying LLMs to multimodal video content analysis by introducing a four-step workflow (codebook design, prompt engineering, LLM processing, and human evaluation) to analyze depression-related short videos. It emphasizes structured annotations and explainable rationales, using keyframes and transcripts to evaluate LLM capabilities against human coders. The study finds that LLMs reliably annotate observable objects and actions but struggle with abstract notions like emotion and genre, highlighting both potential and limitations of current LLMs in video analysis. The findings inform future directions for workflow improvements, multi-modal integration, and ethical guidelines to enable scalable, transparent analysis of video data while mitigating risks.

Abstract

Despite the growing interest in leveraging Large Language Models (LLMs) for content analysis, current studies have primarily focused on text-based content. In the present work, we explored the potential of LLMs in assisting video content analysis by conducting a case study that followed a new workflow of LLM-assisted multimodal content analysis. The workflow encompasses codebook design, prompt engineering, LLM processing, and human evaluation. We strategically crafted annotation prompts to get LLM Annotations in structured form and explanation prompts to generate LLM Explanations for a better understanding of LLM reasoning and transparency. To test LLM's video annotation capabilities, we analyzed 203 keyframes extracted from 25 YouTube short videos about depression. We compared the LLM Annotations with those of two human coders and found that LLM has higher accuracy in object and activity Annotations than emotion and genre Annotations. Moreover, we identified the potential and limitations of LLM's capabilities in annotating videos. Based on the findings, we explore opportunities and challenges for future research and improvements to the workflow. We also discuss ethical concerns surrounding future studies based on LLM-assisted video analysis.
Paper Structure (19 sections, 2 figures, 1 table)