Table of Contents
Fetching ...

A Multimodal Social Agent

Athina Bikaki, Ioannis A. Kakadiaris

TL;DR

MuSA addresses the challenge of social content analysis by embedding reasoning, planning, optimization, critique, refinement, and action within a modular, model-agnostic agent. It operates in a closed static environment and uses a planning loop that maps tasks to actionable sequences via equations such as $s_0=\pi(E, t, \Theta, p^{rsn})$, $o_i=...$, $s_{i+1}=...$ to drive decisions, with optimization performed by TextGrad using a loss $\mathcal{L}$ and update $x^{rsn} \leftarrow x^{rsn} - \alpha \frac{\partial \mathcal{L}}{\partial x^{rsn}}$. It leverages chain-of-thought prompting and self-reflection, as well as external tools for semantic memory, to improve reasoning and reduce hallucinations. Empirically, MuSA improves QA, VQA, title generation, and categorization over baselines on HotpotQA, WikiWeb2M, and MN-DS datasets; ablation studies reveal dependencies on model choice, reasoning strategy, and multimodal inputs. This modular, extensible approach offers practical benefits for social listening and decision-support applications by enabling task-specific configurations and cost-optimized deployments.

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable progress in common-sense reasoning tasks. This ability is fundamental to understanding social dynamics, interactions, and communication. However, the potential of integrating computers with these social capabilities is still relatively unexplored. However, the potential of integrating computers with these social capabilities is still relatively unexplored. This paper introduces MuSA, a multimodal LLM-based agent that analyzes text-rich social content tailored to address selected human-centric content analysis tasks, such as question answering, visual question answering, title generation, and categorization. It uses planning, reasoning, acting, optimizing, criticizing, and refining strategies to complete a task. Our approach demonstrates that MuSA can automate and improve social content analysis, helping decision-making processes across various applications. We have evaluated our agent's capabilities in question answering, title generation, and content categorization tasks. MuSA performs substantially better than our baselines.

A Multimodal Social Agent

TL;DR

MuSA addresses the challenge of social content analysis by embedding reasoning, planning, optimization, critique, refinement, and action within a modular, model-agnostic agent. It operates in a closed static environment and uses a planning loop that maps tasks to actionable sequences via equations such as , , to drive decisions, with optimization performed by TextGrad using a loss and update . It leverages chain-of-thought prompting and self-reflection, as well as external tools for semantic memory, to improve reasoning and reduce hallucinations. Empirically, MuSA improves QA, VQA, title generation, and categorization over baselines on HotpotQA, WikiWeb2M, and MN-DS datasets; ablation studies reveal dependencies on model choice, reasoning strategy, and multimodal inputs. This modular, extensible approach offers practical benefits for social listening and decision-support applications by enabling task-specific configurations and cost-optimized deployments.

Abstract

In recent years, large language models (LLMs) have demonstrated remarkable progress in common-sense reasoning tasks. This ability is fundamental to understanding social dynamics, interactions, and communication. However, the potential of integrating computers with these social capabilities is still relatively unexplored. However, the potential of integrating computers with these social capabilities is still relatively unexplored. This paper introduces MuSA, a multimodal LLM-based agent that analyzes text-rich social content tailored to address selected human-centric content analysis tasks, such as question answering, visual question answering, title generation, and categorization. It uses planning, reasoning, acting, optimizing, criticizing, and refining strategies to complete a task. Our approach demonstrates that MuSA can automate and improve social content analysis, helping decision-making processes across various applications. We have evaluated our agent's capabilities in question answering, title generation, and content categorization tasks. MuSA performs substantially better than our baselines.
Paper Structure (24 sections, 3 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: An example for MuSA. Post was selected from X world_health_organization_who_who_delegates_2024.
  • Figure 2: Visualization of MuSA plan and action execution for our selected content analysis tasks (T). The similarity between the responses of the planner and optimizer is assessed using Jensen-Shannon divergence (JSD). MuSA available units (B).
  • Figure 3: Role assignment for a task.
  • Figure 4: Decision plan of a task. Plan A was proposed by the planner, and plan B by the optimizer.
  • Figure 5: An example of a multiple-choice question from a quiz related to public health surveillance cdc_principles_2023. A single trial with different roles within the same environment and task.
  • ...and 3 more figures