Towards Multimodal Large-Language Models for Parent-Child Interaction: A Focus on Joint Attention

Weiyan Shi; Viet Hai Le; Kenny Tsu Wei Choo

Towards Multimodal Large-Language Models for Parent-Child Interaction: A Focus on Joint Attention

Weiyan Shi, Viet Hai Le, Kenny Tsu Wei Choo

TL;DR

This study probes whether Multimodal Large-Language Models (MLLMs) can detect and reason about joint attention in parent-child interactions, a cornerstone of early language development. Using a dataset of 26 publicly sourced videos annotated by two Speech-Language Pathologists (SLPs), the authors evaluate three MLLMs (GPT-4o, Gemini 1.5 Flash, Video-ChatGPT) against fine-grained temporal ground-truths, employing metrics such as $mIoU$ and $R@m$, as well as time-sensitive and eye-contact related descriptors. Findings show that current MLLMs struggle to accurately interpret joint attention due to limited integration of child eye-contact cues, with GPT-4o performing best on timing and description but still hampered by eye-contact challenges. The work highlights the need for explicit gaze information and diversified datasets to advance multimodal reasoning for parent-child interaction analysis and its potential to inform speech-language therapy tools.

Abstract

Joint attention is a critical component of early speech-language development and a key indicator of effective parent-child interaction. However, research on detecting and analysing joint attention remains limited, particularly for Multimodal Large Language Models (MLLMs). This study evaluates MLLMs' ability to comprehend joint attention by analysing 26 parent-child interaction videos annotated by two speech-language pathologists. These annotations identify strong and poor joint attention segments, serving as benchmarks for evaluating the models' interpretive capabilities. Our findings reveal that current MLLMs struggle to accurately interpret joint attention due to a lack of nuanced understanding of child-initiated eye contact, a crucial component of joint attention dynamics. This study highlights the importance of incorporating detailed eye contact to enhance MLLMs' multimodal reasoning. Addressing these gaps is essential for future research to advance the use of MLLMs in analysing and supporting parent-child interactions.

Towards Multimodal Large-Language Models for Parent-Child Interaction: A Focus on Joint Attention

TL;DR

Abstract

Towards Multimodal Large-Language Models for Parent-Child Interaction: A Focus on Joint Attention

TL;DR

Abstract

Paper Structure

Table of Contents