Table of Contents
Fetching ...

Towards Aligning Multimodal LLMs with Human Experts: A Focus on Parent-Child Interaction

Weiyan Shi, Kenny Tsu Wei Choo

TL;DR

This study addresses whether multimodal LLMs can align with speech-language pathologists in analyzing joint attention during parent–child interactions. It proposes a two-stage framework: first, describe observable behaviours using expert-informed prompts focused on gaze, action, and vocalisation; then, judge interaction quality using few-shot examples derived from expert practices. Observational alignment achieved high accuracy (around 0.86–0.88) across behaviour categories, while judgement alignment remained modest (accuracy ≈ 0.57, Cohen’s κ ≈ 0.18), with performance hampered by data imbalance and subjective interpretive differences. The findings show the feasibility of expert-aligned observation in MLLMs and highlight the challenges of encoding interpretive judgement, offering concrete design implications for human–AI alignment in socially situated AI tools and guidance for future, larger-scale studies.

Abstract

While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.

Towards Aligning Multimodal LLMs with Human Experts: A Focus on Parent-Child Interaction

TL;DR

This study addresses whether multimodal LLMs can align with speech-language pathologists in analyzing joint attention during parent–child interactions. It proposes a two-stage framework: first, describe observable behaviours using expert-informed prompts focused on gaze, action, and vocalisation; then, judge interaction quality using few-shot examples derived from expert practices. Observational alignment achieved high accuracy (around 0.86–0.88) across behaviour categories, while judgement alignment remained modest (accuracy ≈ 0.57, Cohen’s κ ≈ 0.18), with performance hampered by data imbalance and subjective interpretive differences. The findings show the feasibility of expert-aligned observation in MLLMs and highlight the challenges of encoding interpretive judgement, offering concrete design implications for human–AI alignment in socially situated AI tools and guidance for future, larger-scale studies.

Abstract

While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.

Paper Structure

This paper contains 38 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Flow chart of the video screening process
  • Figure 2: Examples from three categories in our dataset: behavioural guidance, language development, and daily life interaction.
  • Figure 3: Our video annotation tool supports SLPs’ judgement process with four main components: Video Playback Area for watching, pausing, and replaying parent–child interactions (shown here with a screenshot from Video 24. Timeline Annotation Area for selecting and labelling segments as strong or poor joint attention; Note-Taking Area for recording justifications or observations; and Control Button Area for task submission and navigation.
  • Figure 4: Radar plots comparing zero-shot (red) and many-shot (blue) prompting across overall and per-class evaluation metrics. many-shot consistently improves accuracy, macro-F1, and Cohen’s $\kappa$ in the overall condition, while also boosting performance on Strong and Moderate categories. Performance on the Poor category remains weak under both conditions, reflecting its limited representation in the dataset.