Table of Contents
Fetching ...

Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya, Yaman Kumar Singla, Sudhir Yarram, Somesh Kumar Singh, Harini S, James Z. Wang

TL;DR

ToT2MeM introduces a large-scale unsupervised dataset for visual memorability signals derived from Tip-of-the-Tongue recall queries, enabling descriptive recall generation and multimodal ToT retrieval. It provides 470k content-recall pairs and a video subset of 82,500 videos with OCR and transcripts, bridging open-ended recall with video content. Fine-tuning lightweight vision-language models on ToT2MeM yields strong gains over baselines for recall generation and enables a competitive ToT retrieval model via contrastive learning. The work demonstrates scalable, cross-domain memorability signals with potential applications in content design and retrieval while acknowledging ethical considerations and data biases.

Abstract

Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

TL;DR

ToT2MeM introduces a large-scale unsupervised dataset for visual memorability signals derived from Tip-of-the-Tongue recall queries, enabling descriptive recall generation and multimodal ToT retrieval. It provides 470k content-recall pairs and a video subset of 82,500 videos with OCR and transcripts, bridging open-ended recall with video content. Fine-tuning lightweight vision-language models on ToT2MeM yields strong gains over baselines for recall generation and enables a competitive ToT retrieval model via contrastive learning. The work demonstrates scalable, cross-domain memorability signals with potential applications in content design and retrieval while acknowledging ethical considerations and data biases.

Abstract

Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

Paper Structure

This paper contains 29 sections, 1 equation, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Our complete data collection and task pipeline. We use Tip-of-the-Tongue (ToT) search posts from Reddit (top left), and collect data through a rigorous filtering process. This leads us to obtain data points that are essentially recall-content pairs. The original Reddit search query becomes the descriptive recall, as it is what a user tries to identify the content from their memory recalls, and subsequently expresses on the platform. The correct content is retrieved from within the comments made to the posts. We also create a video-based subset of the data by downloading the raw visual information from YouTube and providing additional details such as audio transcripts and OCR. We propose two tasks using our dataset: Descriptive Recall Generation and Multimodal ToT retrieval. We also present ToT2MeM-Recall and ToT2MeM-Retrieval, respectively, to generate descriptive memorability recall or perform multimodal retrieval, trained on our dataset.
  • Figure 2: (a) Correlation between external popularity (measured using YouTube views) and number of times searched on Reddit. Zoomed-in graphs for each group (based on the search count) are included in the Appendix \ref{['app:data_analysis']}. We also include the analysis using Wikipedia page views as a measure of popularity in the Appendix. (b) Correlation between external popularity (YouTube views) of the content and the time between the original post about the content is made, and when someone comments with the correct answer. (c) Relationship between popularity, as measured by the average box office collections, of different genres, and the number of posts found in ToT forums. A more detailed picture of the relationship between genre popularity and searches is presented in Appendix \ref{['app:data_analysis']}.
  • Figure 3: (a) Average Response Time (in Hours) for content belonging to each genre. Response time refers to the time elapsed between a post/search being made and the time when the correct answer is provided in the comments. (b) Comparison of searches made for content in each genre, with time (in days) since the release of that content (as obtained from the Wikipedia page creation date).
  • Figure 4: (a): Top 10 Reddit threads used to construct our dataset. (b): Top 10 content types included in our dataset. (c): Top domains to which direct links are present in the dataset, indicating most usually that the correct content item is referred to using a link to these domains. (d): Distribution of emotions in the original Reddit posts. In other words, these are the emotions expressed within the recall signals. The analysis uses the culturally robust HICEM emotion model wortman2023hicem.
  • Figure 5: An example of different signals utilized for solving the valid answer. The thread moderator bot provides a comment, which is usually pinned, highlighting the correctly solved answer. Note that the thread moderator usually exactly copies the correct answer and provides it additionally. Further, the actual solving comment can also be found by tracking replies from the original poster, in case it provides directly confirming signals (such as in this case, by saying "Solved!").
  • ...and 17 more figures