Table of Contents
Fetching ...

Multimodality and Attention Increase Alignment in Natural Language Prediction Between Humans and Computational Models

Viktor Kewenig, Andrew Lampinen, Samuel A. Nastase, Christopher Edwards, Quitterie Lacome DEstalenx, Akilles Rechardt, Jeremy I Skipper, Gabriella Vigliocco

TL;DR

The results indicate that improved modeling of naturalistic language processing in mAI does not merely depend on training diet but can be driven by multimodality in combination with attention-based architectures.

Abstract

The potential of multimodal generative artificial intelligence (mAI) to replicate human grounded language understanding, including the pragmatic, context-rich aspects of communication, remains to be clarified. Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words. Correspondingly, multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities. To test whether these processes align, we tasked both human participants (N = 200) as well as several state-of-the-art computational models with evaluating the predictability of forthcoming words after viewing short audio-only or audio-visual clips with speech. During the task, the model's attention weights were recorded and human attention was indexed via eye tracking. Results show that predictability estimates from humans aligned more closely with scores generated from multimodal models vs. their unimodal counterparts. Furthermore, including an attention mechanism doubled alignment with human judgments when visual and linguistic context facilitated predictions. In these cases, the model's attention patches and human eye tracking significantly overlapped. Our results indicate that improved modeling of naturalistic language processing in mAI does not merely depend on training diet but can be driven by multimodality in combination with attention-based architectures. Humans and computational models alike can leverage the predictive constraints of multimodal information by attending to relevant features in the input.

Multimodality and Attention Increase Alignment in Natural Language Prediction Between Humans and Computational Models

TL;DR

The results indicate that improved modeling of naturalistic language processing in mAI does not merely depend on training diet but can be driven by multimodality in combination with attention-based architectures.

Abstract

The potential of multimodal generative artificial intelligence (mAI) to replicate human grounded language understanding, including the pragmatic, context-rich aspects of communication, remains to be clarified. Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words. Correspondingly, multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities. To test whether these processes align, we tasked both human participants (N = 200) as well as several state-of-the-art computational models with evaluating the predictability of forthcoming words after viewing short audio-only or audio-visual clips with speech. During the task, the model's attention weights were recorded and human attention was indexed via eye tracking. Results show that predictability estimates from humans aligned more closely with scores generated from multimodal models vs. their unimodal counterparts. Furthermore, including an attention mechanism doubled alignment with human judgments when visual and linguistic context facilitated predictions. In these cases, the model's attention patches and human eye tracking significantly overlapped. Our results indicate that improved modeling of naturalistic language processing in mAI does not merely depend on training diet but can be driven by multimodality in combination with attention-based architectures. Humans and computational models alike can leverage the predictive constraints of multimodal information by attending to relevant features in the input.
Paper Structure (33 sections, 4 figures)

This paper contains 33 sections, 4 figures.

Figures (4)

  • Figure 1: Unimodal Methods: (First column) For the first probability measure, both the incoming dialogue (input) and all labels in the movie (a) are encoded by CLIP's text encoder in the 'text only' version of the model (b). Predictability is derived as the softmaxed similarity scores (over all labels) between the upcoming label and the resulting encodings (c). (Second column) For the second probability measure, the textual input (a) is fed directly to LLaMA (b). Predictability is derived by pulling out the next-word logits from the model's forward method for all labels in the movie and applying a softmax over them to obtain a probability distribution (c). (Third column) For the prompt-based measure, the textual input is combined with a prompt asking to estimate the predictability of the upcoming word, which is processed by the model (b) and results in a direct prompt-based output measure (c). While GPT-4 and LLaMA are from the same model family, GPT-4 has many more parameters than LLaMA. (Fourth Column) For the human measure, instructions are presented to human participants (similar to the prompt used in the prompt-based measure) (a) before they listen to an audio clip (b) and provide predictability estimates on a Likert scale (c) from 0 (Low Relevance) to 100 (High Relevance). Multimodal Methods: (First column) For the first direct probability measure, both the incoming dialogue (input) and all labels in the movie (d) are encoded by CLIP's text encoder. The visual information (frame-by-frame) is encoded by the visual transformer backbone (we used both the 'ViT-32' and the 'RN50' versions of the model) (e). Predictability is derived as the softmaxed similarity scores (over all labels in the movie) between the upcoming label and the resulting multimodal encodings (f). (Second column) For the second direct probability measure, visual input was fed frame-by-frame to the adapter layer. Textual input was fed to the LLaMA model directly (d). Both text and visual information were then processed by the model (e). Predictability scores were derived as the softmaxed next-word logits for all labels in the movie's dialogue (f). (Third column) For the prompt-based measure, the visual input was fed as a GIF to the GPT-4 API, together with a prompt (d). This input was processed by the model with the temperature parameter set to zero (e). Predictability was the direct, deterministic outcome following the prompt (f). (Fourth column) For the human measure, human participants received instructions similar to the prompt fed to GPT-4 in the prompt-based measure (d). Humans then watched the 6 s video clip (e) while their eye movements were tracked through their webcam. Participants indicated relevance on a Likert scale from 0 (Low Relevance) to 100 (High Relevance) (f).
  • Figure 2: Results for comparing unimodal, multimodal model and human predictability scores. (a) Average human response per audio-visual (multimodal) and audio-only (unimodal) stimuli (Y-axis) plotted against model response (X-Axis) for unimodal and multimodal GPT-4, LLaMA, and CLIP. For all multimodal models, the predicted model response (black regression line, only displayed for significant predictions) aligns significantly more with human predictability estimates compared to their unimodal counterpart. (b) Comparison between Pearson correlations of predictability scores derived from multimodal and unimodal models and human predictability estimates. The human ceiling is depicted as a dashed green line. Unimodal LLaMA scores were only marginally correlated with human predictability estimates based on audio-clips (light blue bar), multimodal LLaMA scores were positively correlated with human predictability estimates from video-clips (including speech) (dark blue bar). Similarly, scores based on unimodal GPT-4 were more weakly correlated with human predictability estimates from textual information (light pink bar) compared to multimodal GPT-4 scores correlated with human predictability estimates from video-clips (dark pink bar). Finally, scores extracted from the text-only CLIP model (light orange were marginally correlated with human predictability, while there was a high positive correlation between human predictability scores and the multimodal CLIP model (dark orange). Error bars represent 95% confidence intervals. Stars indicate significance.
  • Figure 3: Results for comparing predictability scores in the top quartile of model predictability for CLIP with (ViT-32) and without (RN50) attention. (a) Average human response per video (Y-axis) plotted against model response (X-axis) for the CLIP model with and without attention. Top quartile of model scores is surrounded by a dashed grey box. A zoomed-in scatterplot displays human-model prediction alignment only for this top quartile (on the right). Regression lines are only displayed for significant predictions. Stars indicate strength of significance. (b) Bar plot comparing correlations of predictability scores for the respective top quartiles dervied from multimodal and unimodal versions of both LLaMA and GPT-4 with human predictability estimates. The human ceiling is displayed as a green dashed line. LLaMA scores were the lowest (dark blue), followed by GPT-4 (dark pink), and CLIP-RN50 (yellow). The highest score by far was obtained from the CLIP-ViT32 version with a visual attention mechanism (dark orange). Error bars indicate 95% confidence intervals.
  • Figure 4: Alignment between human gaze and model attention. (a) Layerwise percentage of alignment (model-human correlation divided by human-human correlation). Average alignment for all segments per layer is displayed as a blue line (with blue dots). Average alignment per layer for all segments in video-clips that are in the top quartile of human predictability estimates are displayed as orange line with orange crosses. Average alignment per layer for all segments in video-clips that are in the top quartile of model predictability estimates are displayed as green line with green squares. Average alignment per layer for all segments in video-clips that are in the top quartile of within human eye tracking alignment are displayed as a red line with red crosses. Overall, average overlap for all videos and overlaps based on predictability estimates follow similar trends. Layer three reaches between 40 and 50% of the human ceiling. Attention patches from layers nine and ten are most correlated with human eye tracking, reaching 70% of the human ceiling in those videos where human predictability or model predictability were high. However, when human attention matrices are highly correlated, model attention patches are also highly correlated with human eye tracking, reaching 80 and almost 90% of the human ceiling in layers 9 and 10 respectively. (b) Analysis of whether presence/absence of referent influenced correlation scores for layer 9. Using human ratings collected from 100 participants (red dots), absence or presence of the referent significantly predicted correlation value in a linear regression model (with random intercept for participant ID). In particular, the correlation between human eye tracking and model attention patterns was, on average, 0.2339 greater when the referent was present. (c) One example of evolving alignment in layer 9 between model attention patterns (red) and human eye tracking (green) over one video clip, separated into 15 segments. When salient visual information was present (segments 2, 3, and 4), correlation was positive. However, when the referent was not present, correlation dipped into negative values (e.g. segment 5, r = -0.16).