Table of Contents
Fetching ...

Can Large Language Models Grasp Concepts in Visual Content? A Case Study on YouTube Shorts about Depression

Jiaying "Lizzy" Liu, Yiheng Su, Praneel Seth

TL;DR

This study investigates whether multimodal large language models can grasp abstract visual concepts in video content, focusing on depression-related YouTube Shorts. It uses four prompting strategies with LLaVA-1.6 Mistral 7B to annotate 725 keyframes and compares AI outputs to human annotations, revealing that alignment is highly sensitive to how concepts are operationalized, the intrinsic complexity of concepts, and the diversity of video genres. Increased prompt detail does not guarantee better alignment, highlighting the trade-off between guidance and flexibility in multimodal interpretation. The work emphasizes the potential of AI to scale visual content analysis while underscoring the need for human-centered auditing, temporality integration, and careful prompt design to maintain reliable interpretations.

Abstract

Large language models (LLMs) are increasingly used to assist computational social science research. While prior efforts have focused on text, the potential of leveraging multimodal LLMs (MLLMs) for online video studies remains underexplored. We conduct one of the first case studies on MLLM-assisted video content analysis, comparing AI's interpretations to human understanding of abstract concepts. We leverage LLaVA-1.6 Mistral 7B to interpret four abstract concepts regarding video-mediated self-disclosure, analyzing 725 keyframes from 142 depression-related YouTube short videos. We perform a qualitative analysis of MLLM's self-generated explanations and found that the degree of operationalization can influence MLLM's interpretations. Interestingly, greater detail does not necessarily increase human-AI alignment. We also identify other factors affecting AI alignment with human understanding, such as concept complexity and versatility of video genres. Our exploratory study highlights the need to customize prompts for specific concepts and calls for researchers to incorporate more human-centered evaluations when working with AI systems in a multimodal context.

Can Large Language Models Grasp Concepts in Visual Content? A Case Study on YouTube Shorts about Depression

TL;DR

This study investigates whether multimodal large language models can grasp abstract visual concepts in video content, focusing on depression-related YouTube Shorts. It uses four prompting strategies with LLaVA-1.6 Mistral 7B to annotate 725 keyframes and compares AI outputs to human annotations, revealing that alignment is highly sensitive to how concepts are operationalized, the intrinsic complexity of concepts, and the diversity of video genres. Increased prompt detail does not guarantee better alignment, highlighting the trade-off between guidance and flexibility in multimodal interpretation. The work emphasizes the potential of AI to scale visual content analysis while underscoring the need for human-centered auditing, temporality integration, and careful prompt design to maintain reliable interpretations.

Abstract

Large language models (LLMs) are increasingly used to assist computational social science research. While prior efforts have focused on text, the potential of leveraging multimodal LLMs (MLLMs) for online video studies remains underexplored. We conduct one of the first case studies on MLLM-assisted video content analysis, comparing AI's interpretations to human understanding of abstract concepts. We leverage LLaVA-1.6 Mistral 7B to interpret four abstract concepts regarding video-mediated self-disclosure, analyzing 725 keyframes from 142 depression-related YouTube short videos. We perform a qualitative analysis of MLLM's self-generated explanations and found that the degree of operationalization can influence MLLM's interpretations. Interestingly, greater detail does not necessarily increase human-AI alignment. We also identify other factors affecting AI alignment with human understanding, such as concept complexity and versatility of video genres. Our exploratory study highlights the need to customize prompts for specific concepts and calls for researchers to incorporate more human-centered evaluations when working with AI systems in a multimodal context.

Paper Structure

This paper contains 23 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Examples of human interpretations of the four selected concepts. We annotate Yes/No for presenting and interacting, High/Low for diversity and arousal. We then compare human interpretations with the MLLM interpretations to evaluate human-AI alignment.
  • Figure 2: Distribution of bootstrap alignment scores across prompt configurations and concepts. The MLLM demonstrates varying capabilities: no single prompt configuration consistently achieves the highest alignment across all concepts.
  • Figure 3: Problematic MLLM Annotations.