Table of Contents
Fetching ...

MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discovery

Feihong Lu, Weiqi Wang, Yangyifei Luo, Ziqin Zhu, Qingyun Sun, Baixuan Xu, Haochen Shi, Shiqi Gao, Qian Li, Yangqiu Song, Jianxin Li

TL;DR

MIKO addresses the challenge of inferring social-media users' intentions from multimodal posts by distilling intention knowledge through a staged LLM/MLLM pipeline. It combines image captioning, key-information extraction, and intention generation aligned with ATOMIC relations to produce an extensive intention knowledge base from 137,287 posts (1,372K intentions). Intrinsic evaluations with two-stage human annotation demonstrate high plausibility and typicality, while extrinsic evaluation shows that incorporating distilled intentions enhances sarcasm detection beyond strong baselines. The approach demonstrates the value of multimodal reasoning for social-intention understanding and offers a scalable framework for downstream social-media analytics and cognition-guided NLP tasks.

Abstract

Social media has become a ubiquitous tool for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicitness of intentions in social media posts, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Kowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, we use an MLLM to interpret the image and an LLM to extract key information from the text and finally instruct the LLM again to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation. We further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.

MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discovery

TL;DR

MIKO addresses the challenge of inferring social-media users' intentions from multimodal posts by distilling intention knowledge through a staged LLM/MLLM pipeline. It combines image captioning, key-information extraction, and intention generation aligned with ATOMIC relations to produce an extensive intention knowledge base from 137,287 posts (1,372K intentions). Intrinsic evaluations with two-stage human annotation demonstrate high plausibility and typicality, while extrinsic evaluation shows that incorporating distilled intentions enhances sarcasm detection beyond strong baselines. The approach demonstrates the value of multimodal reasoning for social-intention understanding and offers a scalable framework for downstream social-media analytics and cognition-guided NLP tasks.

Abstract

Social media has become a ubiquitous tool for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicitness of intentions in social media posts, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Kowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, we use an MLLM to interpret the image and an LLM to extract key information from the text and finally instruct the LLM again to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation. We further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.
Paper Structure (24 sections, 5 figures, 5 tables)

This paper contains 24 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Examples of users' intentions in their social media posts. User 1’s intention is to buy a cost-effective iPhone, while User 2’s intention is to be disappointed with the performance of the young Lakers players.
  • Figure 2: The overall architecture of our work, which encompasses three core components: multi-information reasoning, intention distillation, and multi-view intention effectiveness evaluation. We leverage the LLava and ChatGPT models, employing a novel hierarchical prompt guidance approach to extract image description ($Section~4.1$), key information ($Section~4.2$) and intentions ($Section~4.3$) from user posts. Following this, we annotate the derived intentions based on rationality and credibility, create a benchmark ($Section~5.1$), and assess the performance of various LLMs ($Section~5.3$) and the performance with the help of intentions on sarcasm detection task ($Section~5.4$).
  • Figure 3: An example illustrates the generated image description, key information and intentions. "P" stands for the plausibility and "T" stands for the typicality. Generated tails with good quality (in green) and bad quality (in red) are highlighted. Besides, "H" and "L" indicates the high and low plausibility and typicality scores respectively.
  • Figure 4: Average typicality score of each aspect of intentions. The vertical axis represents the proportion of three different categories within manually annotated intentions, while the horizontal axis displays ten different aspects of intentions.
  • Figure 5: An example illustrates the instructions for the intention generation.