Table of Contents
Fetching ...

Human-AI Alignment of Multimodal Large Language Models with Speech-Language Pathologists in Parent-Child Interactions

Weiyan Shi, Kenny Tsu Wei Choo

TL;DR

This work tackles the challenge of making multimodal LLMs align with expert SLP reasoning to assess joint attention in everyday parent–child interactions. It introduces a three-dimensional cue framework (gaze, action, vocalisation) and a two-stage pipeline: first extracting fine-grained behavioral descriptions from video segments, then judging interaction quality using expert-aligned prompts. Evaluated on 26 naturalistic videos, the approach achieves up to 0.85 segment-level accuracy for cue extraction and around 0.75 precision in simulating expert judgments, with few-shot prompting outperforming reasoning-based prompts for alignment. The study provides design guidelines for building scalable, developmentally aware, and parent-inclusive AI systems to support observation and reflective analysis of early social-communication behaviours.

Abstract

Joint attention is a critical marker of early social-communicative development, yet remains difficult for caregivers to assess without expert guidance. In this work, we explore how multimodal large language models (MLLMs) can be aligned with the reasoning processes of speech-language pathologists (SLPs) to support the interpretation of everyday parent-child interactions. We conducted in-depth interviews and video annotation studies with three experienced SLPs to uncover how they evaluate joint attention based on three core behavioural cues: gaze, action, and vocalisation. Using these insights, we developed a two-stage MLLM-based system that first extracts fine-grained behavioural descriptions from video segments and then judge joint attention quality using expert-aligned prompts. Our evaluation across 26 parent-child interaction videos shows that MLLMs can achieve up to 85% accuracy in perceptual cue extraction and over 75% average precision in simulating expert judgement. We further propose design guidelines for building MLLM-based behaviour observation-judgement systems that align with SLPs, emphasising the structuring of behavioural cues, the construction of exemplar libraries grounded in expert annotations, and the need to personalise system responses based on developmental stage and neurotypical or atypical presentation. This work provides structured behavioural cues derived from SLP expertise, demonstrates the feasibility of aligning SLPs observation and judgement using MLLMs, and offers practical design guidelines for building aligned systems to support parent-child interaction analysis.

Human-AI Alignment of Multimodal Large Language Models with Speech-Language Pathologists in Parent-Child Interactions

TL;DR

This work tackles the challenge of making multimodal LLMs align with expert SLP reasoning to assess joint attention in everyday parent–child interactions. It introduces a three-dimensional cue framework (gaze, action, vocalisation) and a two-stage pipeline: first extracting fine-grained behavioral descriptions from video segments, then judging interaction quality using expert-aligned prompts. Evaluated on 26 naturalistic videos, the approach achieves up to 0.85 segment-level accuracy for cue extraction and around 0.75 precision in simulating expert judgments, with few-shot prompting outperforming reasoning-based prompts for alignment. The study provides design guidelines for building scalable, developmentally aware, and parent-inclusive AI systems to support observation and reflective analysis of early social-communication behaviours.

Abstract

Joint attention is a critical marker of early social-communicative development, yet remains difficult for caregivers to assess without expert guidance. In this work, we explore how multimodal large language models (MLLMs) can be aligned with the reasoning processes of speech-language pathologists (SLPs) to support the interpretation of everyday parent-child interactions. We conducted in-depth interviews and video annotation studies with three experienced SLPs to uncover how they evaluate joint attention based on three core behavioural cues: gaze, action, and vocalisation. Using these insights, we developed a two-stage MLLM-based system that first extracts fine-grained behavioural descriptions from video segments and then judge joint attention quality using expert-aligned prompts. Our evaluation across 26 parent-child interaction videos shows that MLLMs can achieve up to 85% accuracy in perceptual cue extraction and over 75% average precision in simulating expert judgement. We further propose design guidelines for building MLLM-based behaviour observation-judgement systems that align with SLPs, emphasising the structuring of behavioural cues, the construction of exemplar libraries grounded in expert annotations, and the need to personalise system responses based on developmental stage and neurotypical or atypical presentation. This work provides structured behavioural cues derived from SLP expertise, demonstrates the feasibility of aligning SLPs observation and judgement using MLLMs, and offers practical design guidelines for building aligned systems to support parent-child interaction analysis.

Paper Structure

This paper contains 35 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Examples from three categories in our dataset: behavioural guidance, language development, and daily life interaction.
  • Figure 2: Our video annotation tool supports SLPs’ judgement process with four main components: Video Playback Area for watching, pausing, and replaying parent–child interactions; Timeline Annotation Area for selecting and labelling segments as strong or poor joint attention; Note-Taking Area for recording justifications or observations; and Control Button Area for task submission and navigation.
  • Figure 3: Radar plots showing model performance across all three SLPs. Each subplot compares four model configurations (zero-shot/few-shot × reasoning/non-reasoning) across the three joint attention labels (Strong, Moderate, Poor), using precision, recall, and F1-score as axes. Each line represents a different model; larger areas indicate better alignment with expert labels.