Table of Contents
Fetching ...

"Yeah Right!" -- Do LLMs Exhibit Multimodal Feature Transfer?

Benjamin Reichman, Kartik Talamadupula

TL;DR

The paper investigates whether multimodal features learned from speech or human-to-human conversations transfer to unimodal text tasks, focusing on covert deceptive communication as a testbed. It compares GPT-4o (text+image+audio) with GPT-4 Turbo (text+image) and Llama-2-70B-chat with its conversationally tuned variant, across basic versus speech-conversational prompts and prompting strategies. Results indicate that multimodal models can outperform unimodal baselines on deception detection with basic prompting, while chain-of-thought prompting can shift or reverse this advantage for some model pairs. The work provides evidence for cross-modal feature transfer, informs decisions about model deployment for unimodal tasks, and highlights limitations related to unavailable architectures and small, subjective datasets.

Abstract

Human communication is a multifaceted and multimodal skill. Communication requires an understanding of both the surface-level textual content and the connotative intent of a piece of communication. In humans, learning to go beyond the surface level starts by learning communicative intent in speech. Once humans acquire these skills in spoken communication, they transfer those skills to written communication. In this paper, we assess the ability of speech+text models and text models trained with special emphasis on human-to-human conversations to make this multimodal transfer of skill. We specifically test these models on their ability to detect covert deceptive communication. We find that with no special prompting speech+text LLMs have an advantage over unimodal LLMs in performing this task. Likewise, we find that human-to-human conversation-trained LLMs are also advantaged in this skill.

"Yeah Right!" -- Do LLMs Exhibit Multimodal Feature Transfer?

TL;DR

The paper investigates whether multimodal features learned from speech or human-to-human conversations transfer to unimodal text tasks, focusing on covert deceptive communication as a testbed. It compares GPT-4o (text+image+audio) with GPT-4 Turbo (text+image) and Llama-2-70B-chat with its conversationally tuned variant, across basic versus speech-conversational prompts and prompting strategies. Results indicate that multimodal models can outperform unimodal baselines on deception detection with basic prompting, while chain-of-thought prompting can shift or reverse this advantage for some model pairs. The work provides evidence for cross-modal feature transfer, informs decisions about model deployment for unimodal tasks, and highlights limitations related to unavailable architectures and small, subjective datasets.

Abstract

Human communication is a multifaceted and multimodal skill. Communication requires an understanding of both the surface-level textual content and the connotative intent of a piece of communication. In humans, learning to go beyond the surface level starts by learning communicative intent in speech. Once humans acquire these skills in spoken communication, they transfer those skills to written communication. In this paper, we assess the ability of speech+text models and text models trained with special emphasis on human-to-human conversations to make this multimodal transfer of skill. We specifically test these models on their ability to detect covert deceptive communication. We find that with no special prompting speech+text LLMs have an advantage over unimodal LLMs in performing this task. Likewise, we find that human-to-human conversation-trained LLMs are also advantaged in this skill.
Paper Structure (13 sections, 14 tables)