Table of Contents
Fetching ...

Teaching Human Behavior Improves Content Understanding Abilities Of LLMs

Somesh Singh, Harini S, Yaman K Singla, Veeky Baths, Rajiv Ratn Shah, Changyou Chen, Balaji Krishnamurthy

TL;DR

This work shows that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks and releases the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with instruction-tuning data.

Abstract

Communication is defined as "Who says what to whom with what effect". A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 46 video and image understanding tasks over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.

Teaching Human Behavior Improves Content Understanding Abilities Of LLMs

TL;DR

This work shows that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks and releases the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with instruction-tuning data.

Abstract

Communication is defined as "Who says what to whom with what effect". A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 46 video and image understanding tasks over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.
Paper Structure (14 sections, 13 figures, 11 tables)

This paper contains 14 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: The diagram depicts the five factors of communication in the context of an example YouTube video https://www.youtube.com/watch?v=eT8hO4e2iTM and where lies the free lunch. The receiver effect is not used while training Large Vision and Language Models. However, it contains many important signals that can help in understanding the content. The figure shows several comments containing temporal, cognitive, character, context, and user opinion information useful for understanding the video.
  • Figure 2: Behavior-LLaVA is trained to answer behavioral questions like simulating user comments and likes on the video. The model, once trained, shows superior performance than LLaMA-Vid and other VLMs on content-related tasks like emotion recognition, action recognition, question answering, persuasion strategy classification, etc. The original video was showcased in SuperBowl-2024 and is posted on YouTube on the URL https://www.youtube.com/watch?v=OU7BJc96lI4. The video is titled "Perfect 10: The Kia big game commercial featuring the 2024 Kia EV9" by Kia America.
  • Figure 3: Behaviour-LLava achieves much higher zero-shot performance compared to Ad-LLaVA and the base model LLaMA-VID across a diverse suite of image, video, and audio benchmarks.
  • Figure 4: Behavior Instruction fine-tuning template for the video: https://www.youtube.com/watch?v=BKPQkjRF4yY
  • Figure 5: Dense caption generated by Behavior-LLaVA for the video of a Volkswagen ad. The original video is posted at URL: https://www.youtube.com/watch?v=kyuGXPNr-T0. The red-colored text highlights the most important aspects of the video captured by Behavior-LLaVA, demonstrating an understanding of aesthetics, characters, world knowledge, emotion, and spatial relationships. More such examples are given in Figs. \ref{['fig:qualitative-image-1']}, \ref{['fig:qualitative-image-2']}, \ref{['fig:qualitative-image-3']}, and Figs. \ref{['fig:qualitative-video-1']}, \ref{['fig:qualitative-video-2']} for images and videos respectively.
  • ...and 8 more figures