Towards Aligning Multimodal LLMs with Human Experts: A Focus on Parent-Child Interaction
Weiyan Shi, Kenny Tsu Wei Choo
TL;DR
This study addresses whether multimodal LLMs can align with speech-language pathologists in analyzing joint attention during parent–child interactions. It proposes a two-stage framework: first, describe observable behaviours using expert-informed prompts focused on gaze, action, and vocalisation; then, judge interaction quality using few-shot examples derived from expert practices. Observational alignment achieved high accuracy (around 0.86–0.88) across behaviour categories, while judgement alignment remained modest (accuracy ≈ 0.57, Cohen’s κ ≈ 0.18), with performance hampered by data imbalance and subjective interpretive differences. The findings show the feasibility of expert-aligned observation in MLLMs and highlight the challenges of encoding interpretive judgement, offering concrete design implications for human–AI alignment in socially situated AI tools and guidance for future, larger-scale studies.
Abstract
While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.
